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PREFACE 


Rapid changes in today’s environment emphasize the need for models and meth- 
ods capable of dealing with the uncertainty inherent in virtually all systems re- 
lated to economics, meteorology, demography, ecology, etc. Systems involving 
interactions between man, nature and technology are subject to disturbances 
which may be unlike anything which has been experienced in the past. In 
particular, the technological revolution increases uncertainty as each new stage 
perturbs existing knowledge of structures, limitations and constraints. At the 
same time, many systems are often too complex to allow for precise measure- 
ment of the parameters or the state of the system. Uncertainty, nonstationarity, 
disequilibrium are pervasive characteristics of most. modern systems. 

In order to manage such situations (or to survive in such an environment) 
we must develop systems which can facilitate our response to uncertainty and 
changing conditions. In our individual behavior we often follow guidelines that 
are conditioned by ihe need to be prepared for all (likely) eventualities: insur- 
ance, wearing seat-belts, savings versus investments, annual medical check-ups, 
even keeping an umbrella at the office, etc. One can identify two major types 
of mechanisms: the short term adaptive adjustments (defensive driving, mar- 
keting, inventory control, etc.) that are made after making some observations 
of the system’s parameters, and the long term anticipative actions (engineer- 
ing design, policy setting, allocation of resources, investment strategies, etc.). 
The main challenge to the system analyst is to develop a modeling approach 
that combines both mechanisms (adaptive and anticipative) in the presence of a 
large number of uncertainties, and this in such a way that it is computationally 
tractable. 

The technique most commonly used, scenario analysis, to deal with long 
term planning under uncertainty is seriously flawed. Although it can identify 
“optimal” solutions for each scenario (that specifies some values for the un- 
known parameters), it does not provide any clue as to how these “optimal” 
solutions should be combined to produce merely a reasonable decision. 

As uncertainty is a broad concept, it is possible—and often useful—to ap- 
proach it in many different ways. One rather general approach, which has been 
successfully applied to a wide variety of problems, is to assign explicitly or im- 
plicitly, a probabilistic measure—which can also be interpreted as a measure 
of confidence, possibly of subjective nature—to the various unknown parame- 
ters. This leads us to a class of stochastic optimization problerns, conceivable 
with only partially known distribution functions (and incomplete observations 
of the unknown parameters), called stochastic programming problems. They 
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can be viewed as extensions of the linear and nonlinear programming models 
to decision problems that involve random parameters. 

Stochastic programming models were first introduced in the mid 50's by 
Dantzig, Beale, Tintner, and Charnes and Cooper for linear programs with ran- 
dom coefficients for decision making under uncertainty; Dantzig even used the 
name “linear programming under uncertainty”. Nowadays, the term “stochastic 
programming” refers to the whole field—models, theoretical underpinnings, and 
in particular, solution procedures—that deals with optimization problems in- 
volving random quantities (i.e., with stochastic optimization problems}, the 
accent being placed on the computational aspects; in the USSR the term “sto- 
chastic programming” has been used to designate not only various types of 
stochastic optimization problems but also stochastic procedures that can be 
used to solve deterministic nonlinear programming problems but which play a 
particularly important role as solution procedures for stochastic optimization 
problems, cf. Chapter 1, Section 9. 

Although stochastic programming models were first formulated in the mid 
50’s, rather general formulations of stochastic optimization problems appeared 
much earlier in the literature of mathematical statistics, in particular in the 
theory of sequential analysis and in statistical decision theory. All statistical 
problems such as estimation, prediction, filtering, regression analysis, testing 
of statistical hypotheses, etc., contain elements of stochastic optimization; even 
Baycsian statistical procedures involve loss functions that must be minimized. 
Nevertheless, there are differences between the typical formulation of the op- 
timization problems that come from statistics and those from decision making 
under uncertainty. 

Stochastic programming models are mostly motivated by problems arising 
in so-called “here-and-now” situations, when decisions must be made on the 
basis of, existing or assumed, a priori information about the random (relevant) 
quantities, without making additional observation. The situation is typical for 
problems of long term planning that arise in operations research and systems 
analysis. In mathematical statistics we are mostly dealing with “wait-and-see” 
situations when we are allowed to make additional observations “during” the 
decision making process. In addition, the accent is often on closed form solu- 
tions, or on ad hoc procedures that can be applied when there are only a few 
decision variables (statistical parameters that need to be estimated). In sto- 
chastic programming, which arose as an extension of linear programming, with 
its sophisticated computational techniques, the accent is on solving problems 
involving a large number of decision variables and random parameters, and con- 
sequently a much larger place is occupied by the search for efficient solutions 
procedures. 

Unfortunately, stochastic optimization problems can very rarely be solved 
by using the standard algorithmic procedures developed for deterministic opti- 
mization problems. To apply these directly would presuppose the availability 
of efficient subroutines for evaluating the multiple integrals of rather involved 
(nondifferentiable) integrands that characterize the system as functions of the 
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decision variables (objective and constraint functions), and such subroutines 
are neither available nor will they become available short. of a small upheaval 
in (numerical) mathematics. And that is why there is presently not software 
available which is capable of handling general stochastic optimization problems, 
very much for the same reason that there is no universal package for solving 
partial differential equations where one is also confronted by multidimensional 
integratious. A number of computer codes have been written to solve certain 
specific applications, but it is only now that we can reasonable hope to develop 
generally applicable software; generally applicable that is within well-defined 
classes of stochastic optimization problems. This means that we should be 
able to pass from the artisanal to the production level. There are two basic 
reasons for this. First, the available technology (computer technology, numeri- 
cally stable subroutines) has only recently reached a point where the computing 
capabilities match the size of the numerical problems faced in this area. Sec- 
ond, the underlying mathematical theory needed to justify the computational 
shortcuts making the solution of such problems feasible has only recently been 
developed to an implementable level. 

This book is a result of a project on “Numerical Methods for Stochastic 
Optimization Problems” of the Adaptation and Optimization Task of the In- 
ternational Institute for Applied Systems Analysis (IIASA). This project was 
started in 1982. [[ASA’s traditional role as a network coordinator between in- 
dividual scientists as well as research institutes was a vital component of this 
collaborative network of researchers whose interactions contributed significantly 
to the advances made in this field during the last 2-3 years. Let this book serve 
as a testimony to this collaborative effort. 

The book is divided in five parts. Part I is just an introduction to some 
general and particular stochastic programming problems as models for deci- 
sion making under uncertainty. Part I] consists of a number of chapters, each 
covering some of the numerical questions that must be dealt with when devel- 
oping solution procedures for stochastic programming problems. This part is 
also meant to provide the background to the description of the implementation 
of a number of methods given in Part III. Part IV is a collection of selected 
applications and test. problems. This volume, and a tape collecting the com- 
puter codes for stochastic programming problems developed either at LLASA 
or at other research institutions that have collaborated in this project, is the 
state-of-the-art of algorithmic development in this field. The main objective of 
the IIASA project was to demonstrate that software can be built which solves a 
wide variety of stochastic programming problems. For certain classes of prob- 
lems the software now available is nearly of production-level quality, whereas 
for others only experimental codes have been included. This is a first step in 
software development; it should provide a solid base and serious encouragement 
for more ambitious endeavors in this area. 
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PART I 


Models, Motivation and Methods 


CHAPTER 1 
STOCHASTIC PROGRAMMING, AN INTRODUCTION 
Yu. Ermoliev and R. Wets 


The purpose of this introduction is to discuss the way to deal with uncertain- 
ties in a stochastic optimization framework and to develop this theme in a 
general discussion of modeling alternatives and solution strategies. We shall 
be concerned with motivation and general conceptual questions rather than by 
technical details. Most everything is supposed to happen in finite dimensional 
Euclidean space (decision variables, values of the random elements) and we 
shall assume that all probabilities and expectations, possibly in an extended 
real-valued sense, are well defined. 


1.1. Optimization Under Uncertainty 


Many practical problems can be formulated as optimization problems or can 
be reduced to them. Mathematical modeling is concerned with a description of 
various types of relations between the quantities involved in a given situation. 
Sometimes this leads to a unique solution, but more generally it identifies a 
set of possible states, a further criterion being used to choose among them a 
more, or most, desirable state. For example the “states” could be all possible 
structural outlays of a physical system, the preferred state being the one that 
guarantees the highest level of reliability, or an “extremal” state that is chosen 
in terms of certain desired physical property: dielectric conductivity, sonic res- 
onance, etc. Applications in operations research, engineering, economics have 
focussed attention on situations where the system can be affected or controlled 
by outside decisions that should be selected in the best possible manner. To this 
end, the notion of an optimization problem has proved very useful. We think 
of it in terms of a set S whose elements, called the feasible solutions, represent 
the alternatives open to a decision maker. The aim is to optimize, which we 
take here to be minimize, over S a certain function go, the objective function. 
The exact definition of S in a particular case depends on various circumstances, 
but it typically involves a number of functional relationships among the vari- 
ables identifying the possible “states”. As prototype for the set S we take the 
following description 


S:={cER"|eeX,g(2)<0, i=1,...,m} 


where X is a given subset of R” (usually of rather simple character, say R? 
or possibly R” itself), and for 7 = 1,...,m,g; is a real-valued function on R”. 


2 Stochastic Optimization Problems 


The optimization problem is then formulated as: 


find z€X CR" 
such that gi(z) <0, i=1,...,m, (1.1) 


and z= go(z) is minimized. 


When dealing with conventional deterministic optimization problems (lin- 
ear or nonlinear programs), it is assumed that one has precise information about 
the objective function go and the constraints g;. In other words, one knows 
all the relevant quantities that are necessary for having well-defined functions 
Gi;2 =1,...,m. For example, if this is a production model, enough information 
is available about future demands and prices, available inputs and the coeffi- 
cients of the input-output relationships, in order to define the cost function 
go as well as give a sufficiently accurate description of the balance equations, 
i.e., the functions gj, = 1,...,m. In practice, however, for many optimization 
problems the functions g;,7 = 0,...m are not known very accurately and in 
those cases, it is fruitful to think of the functions g; as depending on a pair of 
variables (z,w) with w as vector that takes its values in a set 2 C R?. We may 
think of w as the environment-determining variable that conditions the system 
under investigation. A decision x results in different outcomes 


(go(#,w),g1(2,w),.. +1Im(2,w)) 


depending on the uncontrollable factors, i.e. the environment (state of nature, 
parameters, exogenous factors, etc.). In this setting, we face the following 
“optimization” problem: 


fnd zE€X CR" 
such that gi(z,w) <0, *=1,...,m, (1.2) 


and z(w) = go(z,w) is minimized. 


This may suggest a parametric study of the optimal solution as a function of 
the environment w and this may actually be useful in some cases, but what 
we really seek is some z that is “feasible” and that minimizes the objective 
for all or for nearly all possible values of w in Q, or is some other sense that 
needs to be specified. Any fixed z € X, may be feasible for some w’ € Q, ie. 
satisfy the constraints g;(z,w’) < 0 for 1 = 1,...,m, but infeasible for some 
other w € 9. The notion of feasibility needs to be made precise, and depends 
very much on the problem at hand, in particular whether or not we are able to 
obtain some information about the environment, the value of w, before choosing 
the decision z. Similarly, what must be understood by optimality depends on 
the uncertainties involved as well as on the view one may have of the overall 
objective(s), e.g. avoid a disastrous situation, do well in nearly all cases, etc. We 
cannot “solve” (1.2) by finding the optimal solution for every possible value of 
w in Q, i.e. for every possible environment, aided possibly in this by parametric 
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analysis. This is the approach preconized by scenario analysis. If the problem 
is not insensitive to its environment, then knowing that z! = z*(w!) is the best 
decision in environment w' and z? = x* (w?) is the best decision in environment 
w? does not really tell us how to choose some z that will be a reasonably good 
decision whatever be the environment, w! or w?; taking a (convex) combination 
of z' and 2? may lead to an infeasible decision for both possibilities: problem 
(1.2) with wo =o! orw =o?. 

In the simplest case of complete information, i.e. when the environment 
w will be completely known before we have to choose z, we should, of course, 
simply select the optimal solution of (1.2) by assigning to the variables w the 
known values of these parameters. However, there may be some additional 
restrictions on this choice of z in certain practical situations. For example, if 
the problem is highly nonlinear and/or quite large, the search for an optimal 
solution may be impractical (too expensive, for example) or even physically 
impossible in the available time, the required response-time being too short. 
Then, even in this case, there arises—in addition to all the usual questions of 
optimality, design of solutions procedures, convergence, etc.—the question of 
implementability. Namely, how to design a practical (implementable) decision 
rule (function) 


why 2(w) 


which is viable, i.e. 2(w) is feasible for (1.2) for all w € 9, and that is “optimal” 
in some sense, ideally such that for all w € ,2(w) minimizes go(-,w) on the 
corresponding set of feasible solutions. However, since such an ideal decision 
rule is only rarely simple enough to be implementable, the notion of optimality 
must be redefined so as to make the search for such a decision rule meaningful. 

A more typical case is when each observation (information gathering) will 
only yield a partial description of the environment w : it only identifies a partic- 
ular collection of possible environments, or a particular probability distribution 
on ©. In such situations, when the value of w is not known in advance, for any 
choice of z the values assumed by the functions g;(z,-),? =1,...,m, cannot be 
known with certainty. Returning to the production model mentioned earlier, 
as long as there is uncertainty about the demand for the coming month, then 
for any fixed production level z, there will be uncertainty about the cost (or 
profit). Suppose, we have the very simple relation between z (production level) 
and w (demand): 


nfeup= {gene} rere (1.3) 


where a is the unit surplus-cost (holding cost) and f is the unit shortage-cost. 
The problem would be to find an z that is “optimal” for all foreseeable demands 
w in 0 rather than a function w + 2(w) which would tell us what the optimal 
production level should have been once » is actually observed. 

When no information is available about the environment w, except that 
w € 1 (or to some subset of (2), it is possible to analyze problem (1.2) in terms 


4 Stochastic Optimization Problems 
of the values assumed by the vector 


(90 (2,~),91(2,w),---59m(z,)) 


as w varies in 1. Let us consider the case when the functions g),...,g do not 
depend on w. Then we could view (1.2) as a multiple objective optimization 
problem. Indeed, we could formulate (1.2) as follows: 


find 2z€X CR" 
such that gj(z) <0, t=1,...,m (1.4) 


and foreach w € 0,2, = go(z,w) is minimized. 


At least if © is a finite set, we may hope that this approach would provide us 
with the appropriate concepts of feasibility and optimality. But, in fact such a 
reformulation does not help much. The most commonly accepted point of view 
of optimality in multiple objective optimization is that of Pareto-optimality, 
i.e. the solution is such that any change would mean a strictly less desir- 
able state in terms of at least one of the objectives, here for some w in 1). 
Typically, of course, there will be many Pareto-optimal points with no equiv- 
alence between any such solutions. There still remains the question of how to 
choose a (unique) decision among the Pareto-optimal points. For instance, 
in the case of the objective function defined by (1.3), with O = [w,a] C 
(0,cc) and a > 0,8 > 0, each z = w is Pareto-optimal, see Figure 1.1, 


go(z,) = go(w,w) =0 
go(w,w’) >0 for allu’ Aw. 





1 


w x=W 


Figure 1.1 Pareto-optimality 


One popular approach to selecting among the Pareto-optimal solutions is to 
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proceed by “worst-case analysis”. For a given z, one calculates the worst that 
could happen—in terms of all the objectives—and then choose a solution that 
minimizes the value of the worst-case loss; scenario analysis also relies on a 
similar approach. This should single out some point that is optimal in a pes- 
simistic minimax sense. In the case of the example (1.3), it yields z* = @ which 
suggests a production level sufficiently high to meet every foreseeable demand. 
This may turn out to be a quite expensive solution in the long run! 


1.2 Stochastic Optimization: Anticipative Models 


The formulation of problem (1.2) as a stochastic optimization problem presup- 
poses that in addition to the knowledge of 0, one can rank the future alterna- 
tive environments w according to their comparative frequency of occurrence. In 
other words, it corresponds to the case when weights—an a prior! probability 
measure, objective or subjective—can be assigned to all possible w € M, and 
this is done in a way that is consistent with the calculus rules for probabilities. 
Every possible environment w becomes an element of a probability space, and 
the meaning to assign to feasibility and optimality in (1.2) can be arrived at 
by reasonings or statements of a probabilistic nature. Let us consider the here- 
and-now situation, when a solution must be chosen that does not depend on 
future observations of the environment. In terms of problem (1.2) it may be 
some z € X that satisfies the constraints 


gi(z,~) <0, t=1,...,m, (1.2) 
with a certain level of reliability: 
prob.{w|g;(z,w) <0, ¢=1,...,m} 2a (1.5) 
where a € (0,1), not excluding the possibility a = 1, or in the average: 
E{g;(z,w)} <0, c=1,...,m. (1.6) 


There are many other possible probabilistic definitions of feasibility involving 
not only the mean but also the variance of the random variable g;(z,-), 


Var gi(2,-) = Elg;(2,w) — E{gi(2,~)}]’, 


such as 
E{gi(2,w)} + B(Var g;(2,-))? <0 (1.7) 


for # some positive constant, or even higher moments or other nonlinear func- 
tions of the g;(z,-) may be involved. The same possibilities are available in 
definiting optimality. Optimality could be expressed in terms of the (feasible) 
z that minimizes 

prob.{w|go(z,~) > ao} (1.8) 


6 Stochastic Optimization Problems 


for a prescribed level ag, or the expected value of future cost 
E{go(x,)}, (1.9) 


and so on. 

Despite the wide variety of concrete formulations of stochastic optimiza- 
tion problems, generated by problems of the type (1.2) all of them may finally 
be reduced to the following rather general version given below, and for con- 
ceptual and theoretical purposes it is useful to study stochastic optimization 
problems in those general terms: Given a probability space (1, A, P), that gives 
us a description of the possible environments 2 and all possible events A with 
associated probability measure P, a stochastic programming problem is: 


find x€X CR" 
such that Fj (2) = E{f;(z,w)} 


= | tiles) P (ae) <0, for i =1,...,m, 
and z= F(z) = E{fo(z,w)} 
a / fo(2,w)P(dw) is minimized, 


(1.10) 


where X is a (usually closed) fixed subset of R", and the functions 
fr RR" xQ—4R, t=1,...,m, 


and 
fo: R"°xQ +R :=RU {-co, 400}, 

are such that, at least for every z in X, the expectations that appear in (1.10) 
are well-defined. 

For example, the constraints (1.5) that are called probabilistic or chance 
constraints, will be of the above type if we set: 

; _fa-1 if ge(z,w) <0 for 2=1,...m, 
fi(e,w) = i gael t-) 

The variance, which appears in (1.7) and other moments, are also mathematical 
expectations of some nonlinear functions of the g;(z,-). 

How one actually passes from (1.2) to (1.10) depends very much on the 
concrete situation at hand. For example, the criterion (1.8) and the constraints 
(1.5) are obtained if one classifies the possible outcomes 


go(z,~),91 (z,w),.. “19m(2,), 

as w varies on 1, into “bad” and “good” (or acceptable and nonacceptable). To 
minimize (1.8) is equivalent to minimizing the probability of a “bad” event. The 
choice of the level a as it appears in (1.5), is a problem in itself, unless such a 
constraint is introduced to satisfy contractually specified reliability levels. The 
natural tendency is to choose the reliability level a as high as possible, but 
this may result in a rapid increase in the overall cost. Figure 1.2 illustrates a 
typical situation where increasing the reliability level beyond a certain level & 
may result in enormous additional costs. 
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Reliability 





Figure 1.2 Reliability versus cost. 


To analyze how high one should go in the setting of reliability levels, one should, 
ideally, introduce the loss that would be incurred if the constraints were vio- 
lated, to be balanced against the value of the objective function. Suppose the 
objective function is of type (1.9), and in the simple case when violating the 
constraint g;(z,w) <0, it generates a cost: 


qi ° 9i(2,), (9 2 0) 


proportional to the amount by which we violate the constraint, we are led to 
the objective function: 


m 
fo(2,) = go(z, w) + > G (max{0,¢;(z,~)]), (1.12) 
f=1 
for the stochastic optimization problem (1.10). For the production (inventory) 
model with cost function given by (1.3), it would be natural to minimize the 
expected loss function 


F(z) = af ~—w)P(dw) + of — 2) P(dw) = E{go(z,w)} 


which we can also write as 
Fo(z) = E{max[a(z — w), @(w — z)]}. (1.13) 


A more general class of problems of this latter type comes with the objective 
function: 
F(z) = E may p(z,y,) (1.14) 


where Y C R?. Such a problem can be viewed as a model for decision making 
under uncertainty, where the 2 are the decision variables themselves, the w 
variables correspond to the states of nature with given probability measure P, 
and the y variables are there to take into account the worst case. 
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1.3 About Solution Procedures 


In the design of solution procedures for stochastic optimization problems of 
type (1.10), one must come to grips with two major difficulties that are usually 
brushed aside in the design of solution procedures for the more conventional 
nonlinear optimization problems (1.1): in general, the exact evaluation of the 
functions F;, i = 1,...,m, (or of their gradients, etc.) is out of question, and 
moreover, these functions are quite often nondifferentiable. In principle, any 
nonlinear programming technique developed for solving problems of type (1.1) 
could be used for solving stochastic optimization problems. Problems of type 
(1.10) are after all just special case of (1.1), and this does also work well in 
practice if it is possible to obtain explicit expressions for the functions F;,7 = 
1,...,m, through the analytical evaluation of the corresponding integrals 


File) = EUi(24)} = f Si(eu)P (de). 


Unfortunately, the exact evaluation of these integrals, either analytically or nu- 
merically by relying on existing software for quadratures, is only possible in 
exceptional cases; for very special types of probability measures P and inte- 
grands f;(z,-). For example, to calculate the values of the constraint function 
(1.5) even for m = 1, and 


gi(2,w) =h(w) — eto); (1.15) 


with random parameters h(-) and ¢;(-), it is necessary to find the probability 
of the event 


{ul Se (e}es > Al)} 


as a function of 2 = (21,...,%,). Finding an analytical expression for this 
function is only possible in a few rare cases, the distribution of the random 
variable 


wt h(w) — Sot; (w)a; 


j=l 


may depend dramatically on z; compare z = (0,...,0) and = (1,...,1). 

Of course, the exact evaluation of the functions F; is certainly not possible 
if only partial information is available about P, or if information will only 
become available while the problem is being solved, as is the case in optimization 
systems in which the values of the outputs { f;(z,w),7 =0,...,m} are obtained 
through actual measurements or Monte Carlo simulations. 

In order to bypass some of the numerical difficulties encountered with 
multiples integrals in the stochastic optimization problem (1.10), one may be 
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tempted to solve a substitute problem obtained from (1.2) by replacing the 
parameters by their expected values, i.e. in (1.10) we replace 


E{ fi(z,w)} by f:(z,2), 


where @ = E{w}. This is relatively often done in practice, sometimes the 
optimal solution might only be slightly affected by such a crude approxima- 
tion, but unfortunately, this supposedly harmless simplification, may suggest 
decisions that not only are far from being optimal, but may even “validate” a 
course of action that is contrary to the best interests of the decision maker. As 
a simple example of the errors that may derive from such a substitution let us 
consider: 


fo(z,) = (w2)?,x E R,Plw = +1] = Pla = —1] = > 


then 
fo(2,0) =0, but E{ fo(z,w)} = 2”. 


Not having access to precise evaluation of the function values, or the gra- 
dients of the F;,7 =0,...,m, is the main obstacle to be overcome in the design 
of algorithmic procedures for stochastic optimization problems. Another pecu- 
liarity of this type of problems is that the functions 


atr+F;{z), 2=0,...,m, 
are quite often nondifferentiable—see for example (1.5), (1.7), (1.8), (1.13) and 


(1.14)—they may even be discontinuous as indicated by the simple example in 
Figure 1.3. 





—1 +1 x 


Figure 1.8 Fy{x) = P{wlwz < 1}, Plw = +1) = Plw =-1]= - 
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The stochastic version of even the simplest linear problem may lead to a 
nondifferential problem as vividly demonstrated by Figure 1.3. It is now easy 
to imagine how complicated similar functions defined by linear inequalities in 
R" might become. As another example of this type, let us consider a constraint 
of the type (1.2), i.e. a probabilistic constraint, where the g;(-,w) are linear, 
and involve only one 1-dimensional random variable h(-). The set S of feasible 
solutions are those z that satisfy 


P{a|e+3 > h(w),e <h(w)} > 2, 
where A(-) is equal to 0,2, or 4 each with probability }. Then 
S = [-1,0] U [1,2] 


is disconnected. 

The situation is not always that hopeless, in fact for well-formulated sto- 
chastic optimization problem, we may expect a lot of regularity, such as con- 
vexity of the feasibility region, convexity and/or Lipschitz properties of the 
objective function, and so on. This is well documented in the literature. 

In the next two sections, we introduce some of the most important formula- 
tions of stochastic programming problems and show that for the development of 
conceptual algorithms, problem (1.10) may serve as a guide, in that the difficul- 
ties to be encountered in solving very specific problems are of the same nature 
as those one would have when dealing with the quite general model (1.10). 


1.4 Stochastic Optimization: Adaptive Models 


In the stochastic optimization model (1.10), the decision x has to be chosen by 
using an a priori probabilistic measure P without having the opportunity of 
making additional observations. As discussed already earlier, this corresponds 
to the idea of an optimization model as a tool for planning for possible future en- 
vironments, that is why we used the term: anticipative optimization. Consider 
now the situation when we are allowed to make an observation before choosing 
z, this now corresponds to the idea of optimization in a learning environment, 
let us call it adaptive optimization. 

Typically, observations will only give a partial description of the environ- 
ment w. Suppose B is a collection of sets that contains all the relevant infor- 
mation that could become available after making an observation; we think of 
B as a subset of A. The decision z must be determined on the basis of the 
information available in B, i.e. it must be a function of w whose values are 
B dependent or equivalently is “B-measurable”. The statement of the corre- 
sponding optimization is similar to (1.10), except that now we allow a larger 
class of solutions—the B-measurable functions—instead of just points in R" 
(which in this setting would just correspond to the constant functions on 2). 
The problem is to find a B-measurable function 


wh 2(w) 
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that satisfies: (w) € X for all w, 
E{f,(z(-),)|B}(w) <0, ¢=1,...,m, 


and 
z = Ef fo(z(w),w)} is minimized. (1.16) 


where E{-|B} denotes the conditional expectation given B. Since z is to be a 
B-measurable function, the search for the optimal z, can be reduced to finding 
for each w € 2 the solution of 


find ze XC R" 
such that E{f;(z,-)|B}(w) <0, 7=1,...,m (1.17) 
and 2, = E{fo(z,-)|B}(w) is minimized. 


Each problem of this type has exactly the same features as problem (1.10) 
except that expectation has been replaced by conditional expectation; note that 
problem (1.16) will be the same for all w that belong to the same elementary 
event of B. In the case when w becomes completely known, i.e. when B = A, 
then the optimal w t+ x(w) is obtained by solving for all w, the optimization 
problem: 
find zéEXcR" 
such that f,;(z,w) <0, ¢=1,...,m, (1.18) 


and 2, = fo(z,w) is minimized, 


i.e. we need to make a parametric analysis of the optimal solution as a function 
of w. 

If the optimal decision rule w ++ 2*(w) obtained by solving (1.16), is im- 
plementable in a real-life setting it may be important to know the distribution 
function of the optimal value 


w+ E{fo(2*(-),-)|B}() 


This is known as the distribution problem for random mathematical programs 
which has received a lot of attention in the literature, in particularly in the case 
when the functions f;,2 = 0,...,m, are linear and B = A; references can be 
found in Part V of this volume, consult the section on the distribution problem. 

Unfortunately in general, the decision rule z*(-) obtained by solving (1.17), 
and in particular (1.18), is much too complicate for practical use. For example, 
in our production model with uncertain demand, the resulting output may lead 
to highly irregular transportation requirements, etc. In inventory control, one 
has recourse to “simple”, (¢,S)-policies in order to avoid the possible chaotic 
behavior of more “optimal” procedures; an (8, S)-policy is one in which an order 
is placed as soon as the stock falls below a buffer level ¢ and the quantity ordered 
will restore to a level S the stock available. In this case, we are restricted to a 
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specific family of decision rules, defined by two parameters ¢ and S which have 
to be defined before any observation is made. 

More generally, we very often require the decision rules w ++ 2(w) to belong 
to a prescribed family 

{2(\,-),A eA} 

of decision rules parametrized by a vector X, and it is this \ that must be chosen 
here-and-now before any observations are made. Assuming that the members 
of this family are B-measurable, and substituting z(.,-) in (1.16), we are led 
to the following optimization problem 


find AEA 
such that 2(\,w) € X for allw EN 
H; (A) = E{¢;(2(d,-),-)} <0, t=1,...,m 
and Ho(4) = Ef{ fo(z{A,w),w)} is minimized. 


(1.19) 


This again is a problem of type (1.10), except that now the minimization is with 
respect to \. Therefore, by introducing the family of decision rules {2(A,-),A € 
A} we have reduced the problem of adaptive optimization to a problem of 
anticipatory optimization, no observations are made before fixing the values of 
the parameters .. 

It should be noted that the family {2(\,-),\ € A} may be given implicitly. 
To illustrate this let us consider a problem studied by Tintner. We start with 
the linear programming problem (1.20), a version of (1.2): 


find ze R? 
such that YS ais(w)z, > (w), i=1,...,m 
pol (1.20) 
and z= So cj (w) ay is minimized, 
j=l 


where the a,;(-),6;(:) and c,(-) are positive random variables. Consider the 
family of decision rules: let \,;; be the portion of the 2-th resource to be assigned 
to activity 7, thus 


n 
So Ag = LA 2 0 for i=1,...,mjj =1,...,0, (1.21) 
Jj=1 


and for j = 1,...n, 


2;(A,w) € argmin{e;(w)2|a:;(w)= > Aj;6;(w),¢ = 1,...,m} 
tERy 


ie. 
2;(A,w) = max da 5bi(w)/ai;(w). 
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This decision rule is only as good as the 4,; that determine it. The optimal \’s 
are found by minimizing 


F>Blej(u) max (bj (e)/aisu))} 


j=1 
subject to (1.21), again a problem of type (1.10). 


1.5 Anticipation and Adaptation: Recourse Models 

The (two-stage) recourse problem can be viewed as an attempt to incorporate 
both fundamental mechanisms of anticipation and adaptation within a single 
mathematical model. In other words, this model reflects a trade-off between 
long-term anticipatory strategies and the associated short-term adaptive adjust- 
ments. For example, there might be a trade-off between a road investment’s 
program and the running costs for the transportation fleet, investments in fa- 
cilities location and the profit from its day-to-day operation. The linear version 
of the recourse problem is formulated as follows: 


find 2eR" 
such that F;(z)=6;-Aj;z <0, t=1,...,m, (1.22) 
and Fo(z) =cz + £{Q(z,w)} is minimized 


where 
Q(z,) = inf {g(w)y|W (w)y = h(w) — T(w) 2}; (1.23) 
yeR? 


some or all of the coefficients of matrices and vectors g(-),W(-),h(-) and T(-) 
may be random variables. In this problem, the long-term decision is made before 
any observation of w ~ (g(w),W(w),h(w),T(w)). After the true environment is 
observed, the discrepancies that may exist between h(w) and T(w)z (for fixed 
z and observed A(w) and T'(w)) are corrected by choosing a recourse action y, 
so that 

W(w)y =A(w) -—T(w)z, y>0, (1.24) 
that minimizes the loss 

q(w)y. 

Therefore, an optimal decision z should minimize the total cost of carrying out 
the overall plan: direct costs as well as the costs generated by the need of taking 
correct (adaptive) action. 

A more general model is formulated as follows. A long-term decision z 
must be made before the observation of w is available. For given z € X and 
observed w, the recourse (feedback) action y(z,w) is chosen so as to solve the 
problem 

find yeY oR" 
such that fa;(z,y,w) <0, t=1,...,m’, (1.25) 
and zg = f29(z,y,w) is minimized, 
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assuming that for each z € X and w € | the set of feasible solutions of this 
problem is nonempty (in technical terms, this is known as relatively complete 
recourse). Then to find the optimal z, one would solve a problem of the type: 


find 2zEXcR", (1.26) 
such that Fo(z) = E{ foo(z,y(z,w),)} is minimized. : 
If the state of the environment w remains unknown or partially unknown after 
observation, then 
wh y(z,w) 


is defined as the solution of an adaptive model of the type discussed in Section 
1.4. Given B the field of possible observations, the problem to be solved for 
finding y(x,w) becomes: for each w € 2 


fnd yEeYc R” 
such that E{fo:(z,y,-)|B}(w) <0, i=1,...,m! (1.27) 
and 29. = E{ foo(z,y,°)|B}(w) is minimized. 


If w ++ y(z,w) yields the optimal solution of this collection of problems, then 
to find an optimal z we again have to solve a problem of type (1.26). 
Let us notice that if 


Soo(t,y,w) = ca +q(w)y 


and for 7 =1,...,m’, 
= fT. 5 = > 
ier { l-—a ifT, (w)z +W;(w)y —h;(w) > 0, 
a otherwise 


then (1.26), with the second stage problem as defined by (1.27), corresponds 
to the statement of the recourse problem in terms of conditional probabilistic 
(chance) constraints. 

There are many variants of the basic recourse models (1.22) and (1.26). 
There may be in addition to the deterministic constraints on z some expectation 
constraints such as (1.7), or the recourse decision rule may be subject to various 
restrictions such as discussed in Section 1.4, etc. In any case as is clear from 
the formulation, these problems are of the general type (1.10), albeit with a 
rather complicated function fo(z,w). 
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1.6 Dynamic Aspects: Multistage Recourse Problems 


It should be emphasized that the “stages” of a two-stage recourse problem do 
not necessarily refer to time units. They correspond to steps in the decision 
process, 2 may be a here-and-now decision whereas the y correspond to all future 
actions to be taken in different time period in response to the environment 
created by the chosen z and the observed w in that specific time period. In 
another instance, the z,y solutions may represent sequences of control actions 
over a given time horizon, 


z = (2(0),2(1),...,2(7)), 
y = (y(0),y(1),-.-,9(7)), 


the y-decisions being used to correct for the basic trend set by the z-control 
variables. As a special case we have 


z = (x(0),z(1),...,2(8)), 
y = (y(e+1),...,y(T)), 


that corresponds to a mid-course maneuver at time ¢ when some observations 
have become available to the controller. We speak of two-stage dynamic mod- 
els. In what follows, we discuss in more detail the possible statements of such 
problems. 

In the case of dynamical systems, in addition to the z, y solutions of prob- 
lems (1.26)-(1.25), there may also be an additional group of variables 


2 = (2(0),2(1),.-+ (7) 


that record the state of the system at times 0,1,...,7. Usually, the variables 
2,y,z,W are connected through a (differential) system of equations of the type: 


Az(t) = h(t, z(t), z(t), y(t),w), ¢=0,...,7-1, (1.28) 


where 


Az(t) = z(t + 1) — z({t),z(0) = 20, 


or they are related by an implicit function of the type: 
h(t,z(¢+ 1), 2(), z(t), y(t),w) =0, ¢=0,...,7—1. (1.29) 


The latter one of these is the typical form one finds in operations research mod- 
els, economics and system analysis, the first one (1.28) is the conventional one 
in the theory of optimal control and its applications in engineering, inventory 
control, etc. In the formulation (1.28) an additional computational problem 
arises from the fact that it is necessary to solve a large system of linear or 
nonlinear equations, in order to obtain a description of the evolution of the 
system. 
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The objective and constraints functions of stochastic dynamic problems are 
generally expressed in terms of mathematical expectations of functions that we 
take to be: 


gi(z(0),2(0),y(0),...,2(7),2(T),v(T)), *=0,1,...,m. (1.30) 


If no observations are allowed, then equations (1.28), or (1.29), and (1.30) do 
not depend on y, and we have the following one-stage problem 


find z =(z2(0),2(1),...,2(T)) 
such that 2(t)e X(t) CR", ¢=0,...,T7, 
Az(t) = h(t,z(t),2(t),w), ¢=0,...,7-1, (1.31) 
E(gi(z(0),2(0),...,2(7),2(T),w) $0, *=1,...,m 
and v = E{go(z(0),2(0),...,2(7),2(T),w)} is minimized 


or with the dynamics given by (1.29). Since in (1.28) or (1.29), the variables 
z(t) are functions of (z,w), the functions g; are also implicit functions of (z,w), 
Le. we can rewrite problem (1.31) in terms of functions 


fi(z,~) =9:(z(z,), 2,0), 


the stochastic dynamic problem (1.31) is then reduced to a stochastic opti- 
mization problem of type (1.10). The implicit form of the objective and the 
constraints of this problem requires a special calculus for evaluating these func- 
tions and their derivatives, but it does not alter the general solution strategies 
for stochastic programming problems. 

The two-stage recourse model allows for a recourse decision y that is based 
on (the first stage decision z and) the result of observations. The following 
simple example should be useful in the development of a dynamical version of 
that model. Suppose we are interested in the design of an optimal trajectory 
to be followed, in the future, by a number of systems that have a variety of 
(dynamical) characteristics. For instance, we are interested in building a road 
between two fixed points (see Figure 1.4) at minimum total cost taking into 
account, however, certain safety requirements. To compute the total cost we 
take into account not just the construction costs, but also the cost of running 
the vehicles on this road. 

For a fixed feasible trajectory 


z= (z(0),2(1),.. .2(T)), 


and a (dynamical) system whose characteristics are identified by a parameter 
w € Q, the dynamics are given by the equations, for ¢ = 0,...,7 — 1, and 
Az(t) = z(¢+1) — z(t), 


Az(t) = h(t, z(t), y(é),w), (1.32) 
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Figure 1.4 Road design problem. 


and 
z(0) = 29,2(T) = zr. 


Here the variable ¢ records position (between 0 and 7’). The variables 


y = (y(0),y(1),---,y(T)) 


are the control variables at { = 0,1,...,7' that determine the way a dynamical 
system of type w will be controlled when following the trajectory z from 0 to 
T. The choice of the z-trajectory is subject to certain restrictions, that include 
safety considerations, such as 


|Az(é)| < di, |Az(t) — Az(é -— 1)| < do, (1.33) 
i.e. the first two derivatives cannot exceed certain prescribed levels. 


For a specific system w € 0, and a fixed trajectory z, the optimal control 
actions (recourse) 


y(z,) = (y(0,z,~),y(1,2,0),-.-,¥(7, 2,0) 
is determined by minimizing the loss function 

go(z(0),y(0),...,2(7 —1),y(T -1),2(T),) 
subject to the system’s equations (1.32) and possibly some constraints on y. If 
P is the a priori distribution of the systems parameters, the problem is to find 


a trajectory (road design) z that minimizes in the average the loss function, i.e. 


Fo(z) = E{go(z(0),y(0,z,w),---,2(7 —1),y(T —1,2,),2(T),w)} (1.34) 
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subject to constraints of type (1.33). 

In this problem the observation takes place in one step only. We have 
amalgamated all future observations that will actually occur at different time 
periods in a single collection of possible environments (events). There are situ- 
ations when w has the structure 


w = (w(0),w(1),...,0(Z)) 


and the observations take place in T steps. As an important example of such 
a class, let us consider the following problem: the long term decision z = 
(x(0),2(1),...,2(T')) and the corrective recourse actions y = (y(0),y(1),..., 
y(Z)) must satisfy the linear system of equations: 


Agoz(0) + Boy(0) 2h 
Ajo2(0) + Ay,2(1)} +B,y(1)} > h 


Aroz(0) + Aria(1) +--+ Arre(T) +Bry(T) > A(T), 
2(0)>0, ... ,2(T)>0; y(0)>0,...,¥(7) 20 


where the matrices A;,, B; and the vectors h(t) are random, i.e. depend on w. 
The sequence z = (z(0),...,2(7')) must be chosen before any information about 
the values of the random coefficients can be collected. At time ¢ = 0,...,77, the 
actual values of the matrices, and vectors, 


Atk, k= 0,...,t; Be, h(t), d(t} 


are revealed, and we adapt to the existing situation by choosing a corrective 
action y(t,2,w) such that 


t 


y(t,2,w) € argmin[d(t)y|Bry > A(t) - So Ate(h) sy > 0]. 
k=0 


The problem is to find z = (2(0),...,2(T')) that minimizes 
Tr 
Fea) = DetHel) + lalate 24)} (1.85 


subject to 2(0) >0,...,2(T) >0. 

In the functional (1.35), or (1.34), the dependence of y(t,z,w) on z is 
nonlinear, thus these functions do not possess the separability properties nec- 
essary to allow direct use of the conventional recursive equations of dynamic 
programming. For problem (1.31), these equations can be derived, provided 
the functions g;,7 = 0,...,m, have certain specific properties. There are, how- 
ever, two major obstacles to the use of such recursive equations in the stochastic 
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case: the tremendous increase of the dimensionality, and again, the more serious 
problem created by the need of computing mathematical expectations. 

For example, consider the dynamic system described by the system of 
equations (1.28). Let us ignore all constraints except z(t) € X(t), fort = 
0,1,...,7°. Suppose also that 


w = (w(0),w(1),...,(T)) 


where w(t) only depends on the past, i.e. is independent of «(t+ 1),...,w(T). 
Since the minimization of 


Fo(z) = E{go(z(0),2(0),...,2(T),2(T),~)} 
with respect to z can then be written as: 


uct Duh Ao} 


and if go is separable, i.e. can be expressed as 


T-1 


go = > got (Az(t), x(t), w(t)) + gor(z(t),«(T)) 


then 
min Fo (z) = min E{go0(A(0), 2(0),(0))} + min E{go1(Az(1),2(1),(1))} 
le a a — 1(Az(T -1),2(T —1),#{T -—1))}+ 
+ E{gor(z(T),(T))} 


Recall that here, notwithstanding its sequential structure, the vector w is to be 
revealed in one global observation. Rewriting this in backward recursive form 
yields the Bellman equations: 


v¢ (2) = min[E{ go; (h(t, 24, 2,4 (t)), x, w (t)) (1.36) 
+ve4i (2 + h(t, 2, 2,0(t)))}|2 € X(¢)] 
fort =0,...,7 —1, and 
vr (zr) = E{gor(z7,4(T))}, (1.37) 
where v is the value function (optimal loss-to-go) from time ¢ on, given state 
z at time t, that in turn depends on 2(0),2(1),...,2(t — 1). 


To be able to utilize this recursion, reducing ultimately the problem to: 


find z € X(0) C R” such that. vp is minimized, where 
¥9 = E{goo(h(0, zo, 2,4 (0)),2, w(0)) + v1 (zo + (0, 20, x,w(0)))}, 
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we must be able to compute the mathematical expectations 


E{gor(Az(é), 2, w(t))} 


as a function of the intermediate solutions z(0),...,2(t — 1), that determine 
Az(t), and this is only possible in special cases. The main goal in the de- 
velopment of solution procedures for stochastic programming problems is the 
development of appropriate computational tools that precisely overcome such 
difficulties. 

A much more difficult situation may occur in the (full) multistage version 
of the recourse model where observation of some of the environment takes place 
at each stage of the decision process, at which time (taking into account the new 
information collected) a new recourse action is taken. The whole process looks 
like a sequence of alternating: decision-observation-. ..-observation-decision. 

Let z be the decision at stage & = 0, which may itself be split into a 
sequence 2(0),...,2(N), each 2(k) corresponding to that component of z that 
enters into play at stage &, similar to the dynamical version of the two-stage 
model introduced earlier. Consider now a sequence 


y = (y(0),y(1),---,¥(N)) 


of recourse decisions (adaptive actions, corrections), y(*) being associated specif: 
ically to stage &. Let 


By, := information set at stage k, 


consisting of past measurements and observations, thus B, C By41. 
The multistage recourse problem is 


find 2€X CR” 
such that fo;(z)<0, ¢=1,...,mo, 
E{fir(z,y(1),4)|Bi} $0, ¢=1,...,m1, 


(1.38) 
E{ fni(z,y(1),---,9(N),0)|By} <0, *=1,...,my, 
y(k)EY(k), &k=1,...,N, 
and F(z) is minimized 
where 
Fo(z) = £80 {min FE?! {... min EPN-1{ f(z,y(1),..-,y(N),w)}.}} 


y(@) y(N -1) 

If the decision z affects only the initial stage & = 0, we can obtain recursive equa- 
tions similar to (1.36) - (1.37) except that expectation E must be replaced by the 
conditional expectations Et , which in no way simplifies the numerical problem 
of finding a solution. In the more general case when z = (z(0),2(1),...,2(V)), 
one can still write down recursion formulas but of such (numerical) complexity 
that all hope of solving this class of problems by means of these formulas must 
quickly be abandoned. 
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1.7 Solving the Deterministic Equivalent Problem 
All of the preceding discussion has suggested that the problem: . 


find 2€R" 


such that F;(z) = f files) P (du) $0, a=1,...,m, (1.39) 


and z= F(z) = | t (t,w)P(dw) is minimized, 


exhibits all the peculiarities of stochastic programs, and that for exploring com- 
putational schemes, at least at the conceptual level, it can be used as the canon- 
ical problem. 

Sometimes it is possible to find explicit analytical expressions for an accept- 
able approximation of the F;. The randomness in problem (1.39) disappears 
and we can rely on conventional deterministic optimization methods for solving 
(1.39). Of course, such cases are highly cherished, and can be dealt with by 
relying on standard nonlinear programming techniques. 

One extreme case is when @ = E{w} is a certainty equivalent for the 
stochastic optimization problem, i.e. the solution to (1.39) can be found by 
solving: 

find ze X Cc R" 
such that f;(z,0) <0, *=1,...,m, (1.40) 
and z= fo(z,@) is minimized, 


this would be the case if the f;,2 = 0,...,™ are linear functions of w. In general, 
as already mentioned in Section 1.3, the solution of (1.40) may have little in 
common with the initial problem (1.39). But if the f; are convex functions, 
then according to Jensen’s inequality 


E{fi(z,w)} > fi(z,9), t=1,...,m, 


This means that the set of feasible solutions in (1.40) is larger than in (1.39) 
and hence the solution of (1.40) could provide a lower bound for the solution 
of the original problem. 

Another case is a stochastic optimization problem with simple probabilistic 
constraints. Suppose the constraints of (1.39) are of the type 


P{wl> > tz; > hiw)} 2a, f=1,...,m, (1.41) 


j=l 


with deterministic coefficients t;; and random right-hand sides h,(-). Then these 
constraints are equivalent to the linear system 


n 
Yo tiga > AP, g=1,...,m, 
jel 
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where 

hf = inf {t]P[w|hi(w) < t] > a} 
If all the parameters ¢;; and h, in (1.41) are jointly normally distributed (and 
a; > .5), then the constraints 


%=1 


n n n 
dota; +B(D> > rijnazee)? <0 t=1,...,m, 
j=0 


j=0k=0 
can be substituted for (1.41), where 
tio(-) = hil) 
tig i= E{t;;(w)}, j =0,1,...,n, 
Tie = cov(ti;(-),tie(-)), 7 =0,.-.,27k =0,...,2, 

and @ is a coefficient that identifies the a-fractile of the normalized normal 
distribution. 

Another important class are those problems classified as stochastic pro- 
grams with simple recourse (see Chapter 4), or more generally recourse prob- 
lems where the random coefficients have a discrete distribution with a relatively 


small number of density points (support points), as discussed in Chapter 3. For 
the linear model (1.22) introduced in Section 1.5, where 


Q= {(7',W',h',T"),. ony (g" WAN 7%)} 
where for & = 1,...,N, the point (g*,W*,h*,7*) is assigned probability p,, 
one can find the solution of (1.22) by solving: 


n k ni 
find zeER?,(y" eR ,k=1,...,N) 


such that 
Ag > 6, 
Te + Wy} — h}, 
T?z 4+W3y? = A, (1.42) 
Tie +W ry? = hy, 
cat+pig'y! +poq’y? --- +pwg%y% = z, 


and z is minimized. 
This problem has a (dual) block-angular structure. It should be noticed that 


the number N could be astronomically large, if only the vector h is random and 
each component of the vector 


h = (hi,ha,.++, mt) 


has two independent outcomes, then N = 2”. A direct attempt at solving 
(1.42) by conventional linear programming techniques will only yield at each 
iteration very small progress in the terms of the z variables. Therefore, a special 
large scale optimization technique is needed for solving even this relatively 
simple stochastic programming problem. 
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1.8 Approximation Schemes 


If a problem is too difficult to solve one may have to learn to live with approxi- 
mate solutions. The question however, is to be able to recognize an approximate 
solution if one is around, and also to be able to assess how far away from an 
optimal solution one still might be. For this one needs a convergence theory 
complemented by (easily computable) error bounds, improvement schemes, etc. 
This is an area of very active research in stochastic optimization, both at the 
theoretical and the software-implementation level. These questions are studied 
in much more detail in Chapter 2, here we only want to highlight some of the 
questions that need to be raised and the main strategies available in the design 
of approximation schemes. 

For purposes of discussion it will be useful to consider a simplified version 
of (1.39): 

find ze XcR" 


that minimizes F(z) = | foley) P (de), (1.43) 


we suppose that the other constraints have been incorporated in the definition 
of the set X. We deal with a problem involving one expectation functional. 
Whatever applies to this case also applies to the more general situation (1.39), 
making the appropriate adjustments to take into account the fact that the 
functions 


F;(z) = | ti(zsu) P(de), a=1,...,m, 


determine constraints. 

Given a problem of type (1.43) that does not fall in one of the nice cate- 
gories mentioned in Section 1.7, one solution strategy may be to replace it by 
an approximation*. There are two possibilities to simplify the integration that 
appears in the objective function, replace fy by an integrand ff or to replace P 
by an approximation P,, and of course, one could approximate both quantities 
at once. 

The possibility of finding an acceptable approximate of fo that renders the 
calculation of 


J 8 (eo) P(ds) = #80), 
sufficiently simple so that it can be carried out analytically or numerically at 


low-cost, is very much problem dependent. Typically one should search for a 
separable function of the type 


f9 (2,0) = 3° p;(2,44), 


gol 


* Another approach will be discussed in Section 1.9. 
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recall that OQ C R%, so that 
q q 
FE (2) => f es(esus)P(da) = Yo f sles 4s) Py (te) 
j=l j=l 


where the P; are the marginal measures associated to the j-th component of 
w. The multiple integral is then approximated by the sum of 1-dimensional 
integrals for which a well-developed calculus is available, (as well as excellent 
quadrature subroutines). Let us observe that we do not necessarily have to find 
approximates that lead to 1-dimensional integrals, it would be acceptable to 
end up with 2-dimensional integrals, even in some cases—when P is of certain 
specific types—with 3-dimensional integrals. In any case, this would mean that 
the structure of fo is such that the interactions between the various components 
of w play only a very limited role in determining the cost associated to a pair 
(z,w). Otherwise an approximation of this type could very well throw us very 
far off base. We shall not pursue this question any further since they are best 
handled on a problem by problem basis. If {ffu,v =1,...} is a sequence of 
such functions converging, in some sense, to f, we would want to know if the 
solutions of 


x” © argmin F” = | 18.0) P (de), v=1,... 


converge to the optimal solution of (1.43) and if so, at what rate. These ques- 
tions would be handled very much in the same way as when approximating the 
probability measure as will be discussed next. 

Finding valid approximates for fo is only possible in a limited number of 
cases while approximating P is always possible in the following sense. Suppose 
P, is a probability measure (that approximates P), then 


[FS(2) ~ Fo(a)| < f [fole)| [Ps ~ Pld), (1.44) 
where now 
Fy (2) =| fo(z,w) P, (dw). 


Thus if fp has Lipschitz properties, for example, then by choosing P,, sufficiently 
close to P we can guarantee a maximal error bound when replacing (1.43) by: 


fnd zExXcR" 


that minimizes F§(2) =f fo(e,w) Pe (du. (1.45) 


Since it is the multidimensional integration with respect to P that was the 
source of the main difficulties, the natural choice—although in a few concrete 
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cases there are other possibilities—for P, is a discrete distribution that assigns 
to a finite number of points 


the probabilities 
Ply P2.+++5PL3 


Problem (1.45) then becomes: 


find 2é€X CR” 


L 
1.46 
that minimizes Fy (x) = >- pefo{z,0*) | ) 
é=1 


At first, glance it may now appear that the optimization problem can be solved 
by any standard nonlinear programming, the sum 4 involving only a “finite” 
number of terms, the only question being how “approximate” is the solution of 
(1.46). However, if inequality (1.44) is used to design this approximation, to 
obtain a relatively sharp bound from (1.44), the number L of discrete points 
required may be so large that problem (1.46) is in no way any easier than our 
original problem (1.43). To fix the ideas, if C R!°, and P is a continuous dis- 
tribution, a good approximation—as guaranteed by (1.44)—may require having 
10!° < LE < 10!!! This is jumping from the fire into the frying pan. 

This clearly indicates a need for more sophisticated approximation schemes. 
As background, we have the following convergence results. Suppose {P,,v = 
1,...} is a sequence of probability measures that converge in distribution to P, 
and suppose that for all z € X, the function fo(z,w) is uniformly integrable 
with respect to all P,, and suppose there exists a bounded set D such that 


Dn argmin[FY (2) = f tole.u)P (de) ex] #8 


for almost all v, then 
inf Fo = lim {inf FY) 
xX yoo! X 


and 


ifa” © argmin Fy ,c = lim 2x’k 
b' k-00 


then 
x € argmin Fo. 
x 
The convergence result indicates that we are given a wide latitude in the choice 
of the approximating measures, the only real concern is to guarantee the conver- 
gence in distribution of the P, to P, the uniform integrability condition being 
from a practical viewpoint a pure technicality. 
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However, such a result does not provide us with error bounds, but since 
we can choose the P, in such a wide variety of ways, we could for example have 
P, such that 


eR ci 
inf Fy < inf Fo (1.47) 
and P,+, such that 
inf Fo < inf Fy*? (1.48) 
x x 


providing us with upper and lower bounds for the infimum and consequently 
error bounds for the approximate solutions: 


2” € argmin Fy, and 2”?! 


€ argmin Fyt?, 
x x 


This, combined with a sequential procedure for redesigning the approximations 
P,, so as to improve the error bounds, is very attractive from a computational 
viewpoint since we may be able to get away with discrete measures that involve 
only a relatively small number of points (and this seems to be confirmed by 
computational experience). 

The only question now is how to find these measures that guarantee (1.47) 
and (1.48). There are basically two approaches: the first one exploits the 
properties of the function w ++ fo(z,w) so as to obtain inequalities when taking 
expectations, and the second one chooses P, in a class of probability measures 
that have characteristics similar to P but so that P, dominates or is dominated 
by P and consequently yields the desired inequality (1.47) or (1.48). A typical 
example of this latter case is to choose P, so that it majorizes or is majorized 
by P, another one is to choose P, so that for at least for some % € X: 


P, € argmax if fo(#,)Q(dw)|Q € D] (1.49) 


where D is a class of probability measures on 1 that contains P, for example 
ie {a} [ ~a(de) arity 


Then 


yields an upper bound. If instead of P, in the argmax we take P, in the argmin 
we obtain a lower bound. 

If w ++ fo(z,w) is convex (concave) or at least locally convex (locally 
concave) in the area of interest we may be able to use Jensen’s inequality to 
construct probability measures that yield lower (upper) approximates for Fo and 
probability measures concentrated on extreme points to obtain upper (lower) 
approximates of fy. We have already seen such an example in Section 1.7 in 
connection with problem (1.40) where P is replaced by P, that concentrate all 
the probability mass on @ = E{w}. 
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Given an approximate measure P,, we also need a scheme to refine it 
so that the error bounds can be improved, if necessary. One cannot hope to 
have a universal scheme since so much will depend on the problem at hand as 
well as the discretizations that have been used to build the upper and lower 
bounding problems. There is, however, one general rule that seems to work 
well, in fact surprisingly well, in practice: choose the region of refinement of 
the discretization in such a way as to capture as much of the nonlinearity of 
fo{a, -) as possible. 

It is, of course, not necessary to wait until the optimal solution of an ap- 
proximate problem has been reached to refine the discretization of the probabil- 
ity measure. Conceivably, and ideally, the iterations of the solutions procedure 
should be intermixed with the sequential procedure for refining the approxi- 
mations. Common sense dictates that as we approach the optimal solution we 
should seek better and better estimates of the function values and its gradients. 
How many iterations should one perform before a refinement of the approxima- 
tion is introduced, or which tell-tale sign should trigger a further refinement, 
are questions that have only been scantily investigated, but are ripe for study 
at least for certain specific classes of stochastic optimization problems. 

As to the rate of convergence this is a totally open question, in general 
and in particular, except on an experimental basis where the results have been 
much better than what could be expected from the theory. One open challenge 
is to develop the theory that validates the convergence behavior observed in 
practice. 


1.9 Stochastic Procedures 
Let us again consider the general formulation (1.10) for stochastic programs: 


find zEX CR" 
such that F; (2) = f tile) P (du) < 0, t= Vesey ny (1 50) 


and Fo(z) = f fol2)P (dw) is minimized. 


We already know from the discussion in Sections 1.3 and 1.7 that the exact 
evaluation of the integrals is only possible in exceptional cases, for special types 
of probability measures P and integrands f;. The rule in practice is that it 
is only possible to calculate random observations f;(z,w) of F;(z). Therefore 
in the design of universal solution procedures we should rely on no more than 
the random observations f;(z,w). Under these premises, finding the solution of 
(1.50) is a difficult problem at the border between mathematical statistics and 
optimization theory. For instance, even the calculation of the values F;(Z),7 = 
0,...,m, for a fixed & requires statistical estimation procedures: on the basis 
of the observations 


fi(Z,w°), fi(Z,w!),...,fi(Fw),... 
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one has to estimate the mean value 
E{ f;(%,w)}. 


The answer to the simplest question, whether or not a given Z € X is feasible, 
requires verifying the statistical hypothesis that 


E{fi(%,)} <0, for? =1,...,m. 


Since we can only rely on random observations, it seems quite natural to think 
of stochastic solution procedures that do not make use of the exact values of 
the F;(z), 7 =0,...,m. Of course, we cannot guarantee in such a situation 
a monotonic decrease (or increase) of the objective value as we move from one 
iterate to the next, thus these methods must, by the nature of things, be non- 
monotonic. 

Deterministic processes are special cases of stochastic processes, thus sto- 
chastic optimization gives us an opportunity to build more flexible and effec- 
tive solution methods for problems that cannot be solved within the standard 
framework of deterministic optimization techniquest. Stochastic quasi-gradient 
methods is a class of procedures of that type. They are described in more detail 
in Chapter 6, here we shall only sketch out their major features. We consider 
two examples in order to get a better grasp of the main ideas involved. 

Example 1: Optimization by simulation. Let us imagine that the problem 
is so complicated that a computer based simulation model has been designed 
in order to indicate how the future might unfold in time for each choice of a 
decision z. Suppose that the stochastic elements have been incorporated in 
the simulation so that for a single choice x repeated simulation runs results in 
different outputs. We always can identify a simulation run as the observation 
of an event (environment) w from a sample space (. To simplify matters, let 
us assume that only a single quantity 


fo (z, w) 
summarizes the output of the simulation run w for given z. The problem is to 


find xe R" 

= (1.51) 
that minimizes Fo(s) = E{fo(z,~)}. 

Let us also assume that Fp is differentiable. Since we do not know with any 

level of accuracy the values or the gradients of Fy at z, we cannot apply the 

standard gradient method, that generates iterates through the recursion: 


".F (2° + A,e’) — F (2°) ; 
6+1 ._ 8 0 0 
gt eg =p, ) : e, (1.52) 


got 
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where p, is the step-size, A, determines the mesh for the finite difference ap- 
proximation to the gradient, and e? is the unit vector on the j-th axis. A 
well-known procedure to deal with the minimization of functions in this set- 
ting is the so-called stochastic approximation method that can be viewed as 
a recursive Monte-Carlo optimization method. The iterates are determined as 
follows: 


n & ; 87) 6 ,,a0 ‘ 
etl Spay es + Age ,w) — fol(z?,w 1s. (1.53) 


x A. 


j=l 


where w°°,w*!,...,w®" are observations, not necessarily mutually independent 
one possibility is w® = w*! =... = Ww". The sequence {2°,#¢ = 0,1,...} 
generated by the recursion (1.53) converges with probability 1 to the optimal 
solution provided, roughly speaking, that the scalars {p,,A,3¢ = 1,...} are 
chosen so as to satisfy 


ps2 0,>. Is = 0, (0 + peAs) < co, 
8 8 


(¢. = A, = 1/8 are such sequences), the function Fy has bounded second 
derivatives and for all z € R", 


E{||Afo{a,w)||?7} < d(1 + ||z||?),d > 0. (1.54) 


This last condition is quite restrictive, it excludes polynomial functions fo(-,w) 
of order greater than 3. Therefore, the methods that we shall consider next will 
avoid making such a requirement, at least on all of R”. 

Example 2: Optimization by random search. Let us consider the mini- 
mization of a convex function Fp with bounded second derivatives and n a rela- 
tively large number of variables. Then the calculation of the exact gradient V Fo 
at, z requires calling up a large number of times the subroutines for computing 
all the partial derivatives and this might be quite expensive. The finite differ- 
ence approximation of the gradient in (1.52) require (n-+1) function-evaluations 
per iteration and this also might be time-consuming if function-evaluations are 
difficult. Let us consider the following random search method: at each iteration 
6 =0,1..., choose a direction h® at random, see Figure 1.5. 

If Fo is differentiable, this direction h® or its opposite —h° leads into the 
region 

{2|Fo(z) < Fo(2*)} 


of lower values for Fo, unless 2° is already the point at which Fy is minimized. 
This simple idea is at the basis of the following random search procedure: 


io ae #, fo (2° + A,h*) — Fo{2') 


i= g? — he, 1.55 
£ 2 Ag ( ) 
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hs 


Figure 1.5 Random search directions + — h°. 


which requires only two function-evaluations per iteration. Numerical exper- 
imentation shows that the number of function-evaluations needed to reach a 
good approximation of the optimal solution is substantially lower if we use 
(1.55) in place of (1.52). The vectors h°,h!,...,A°,... often are taken to be 
independent samples of vectors /:{-} whose components are independent random 
variables uniformly distributed on |—1,+1]. 

Convergence conditions for the random search method (1.55) are the same, 
up to some details, as those for the stochastic approximation method (1.53). 
They both have the following feature: the direction of movement from each 
z°,¢ =0,1,... are statistic estimates of the gradient VFo(z*). If we rewrite 
the expressions (1.53) and (1.55) as : 


a®t! := 2° — p,€%,6 =0,1,... (1.56) 
where €° is the direction of movement, then in both cases 
E{€*|z°} = VFo(2*) + O(A.) (1.57) 


A general scheme of type (1.56) that would satisfy (1.57) combines the ideas of 
both methods. There may, of course, be many other procedures that fit into 
this general scheme. For example consider the following iterative method: 


ea ae fo(x® + Agh?,w*!) — folz*,w*?) ,, 

& A, 9 

which requires only two observations per iteration, in contrast to (1.53) that 
requires (n +1) observations. The vector 


gr = glolat + Akt ot!) ~ folate"), 


As 
also satisfies the condition (1.57), 
Fo (x° + A,h') — Fo (2°) he} 
A 


& 


(1.58) 


zE{E*|2"} = Ef 


= {ge Fale"))At} + O(a.) = Whale") +0(A,), 
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The convergence of all these particular procedures (1.53), (1.55), (1.58) follow 
from the convergence of the general scheme (1.56)—(1.57). These questions 
are studied in detail in Chapter 6. The vector €° satisfying (1.57) is called a 
stochastic quasi-gradient of Fy at x, and the scheme (1.56)—(1.57) is an example 
of a stochastic quasi-gradient procedure. 

Unfortunately this procedure cannot be applied, as such, to finding the 
solution of the stochastic optimization problem (1.50) since we are dealing with 
a constrained optimization problem, and the functions F;,7 = 0,...,m, are in 
general nondifferentiable. So, let us consider a simple generalization of this pro- 
cedure for solving the constrained optimization problem with nondifferentiable 
objective: 

fnd xE€X CR" 


1.59 
that minimzes F(z) 2?) 


where X is closed convex set and Fy is a real-valued (continuous) convex func- 
tion. The new algorithm generates a sequence z°,z!,...,2°,... of points in X 
by the recursion: 

ott pri[2” — pe€*| (1.60) 


where prjx means projection on X, and €° satisfies 
E{€*|x2°,2',...,2°} € OFo(2*) +7° (1.61) 


with 

OFo(2°) := the set of subgradients of fo at 2’, 
and 7° is a vector, that may depend on (z°,..., 2°), that goes to 0 (in a certain 
sense) as 8 goes to oo. The sequence {2z°,s = 0,1,...} converges with proba- 


bility 1 to an optimal solution, when the following conditions are satisfied with 
probability 1: 


Pe = 0,>> bs = 00, 5 E{pelln*| + pa} < oO, 
8 8 


and 
E{\lé°||?|2°,...,2°} is bounded whenever {2°,..., 2°} is bounded. 


Convergence of this method, as well as its implementation, and different gen- 
eralizations are considered in Chapter 6. 

To conclude let us suggest how the method could be implemented to solve 
the linear recourse problem (1.22). From the duality theory for linear program- 
ming, and the definition (5.2) of Q, one can show that 


8Q(2,w) = {-uT(w)]u e argmax[v (A (w — T (w)2)|oW (w) < ¢(w)}}. 
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Thus an estimate €° of the gradient of Fo at 2° is given by 
gta e— wT! 
where w® is obtained by random sampling from ( (using the measure P), and 
we argmax(v(4(w*) —T(w*)2)|uW (w*) < q(w*)] 
The iterates could then be obtained by 
pet lie riz” + peu°T (w*) — pac] 


where 

X ={ze Ri |Ag < }}. 
It is not difficult to show that under very weak regularity conditions (involving 
the dependence of W(w) on w), 


E{€*|2"} € AF (2*). 


1.10 Conclusion 

In guise of conclusion, let us just raise the following possibility. The stochastic 
quasi-gradient method can operate by obtaining its stochastic quasi-gradient 
from 1 sample of the subgradients of fo(-,w) at x°, it could equally well—if this 
was viewed as advantageous—obtain its stochastic quasi-gradient €° by taking 
a finite sample of the subgradients of fo(-,~) at 2°, say Z of them. We would 
then set 


Lb 
€° t= -> v° where v? € Ofy (2?,w*) (1.62) 
é=1 
and w!,...,w” are random samples (using the measure P). The question of 
the efficiency of the method taking just 1 sample versus Z > 1 should, and 
has been raised, cf. the implementation of the methods described in Chapter 
16. But this is not the question we have in mind. Returning to Section 1.8, 
where we discussed approximation schemes, we nearly always ended up with an 
approximate problem that involves a discretization of the probability measures 
assigning probabilities 71,...,p, to points w!,...,w”, and if a gradient-type 
procedure was used to solve the approximating problem, the gradient, or a 
subgradient of Fy at 2° would be obtained as 


L 
= S- pev? where v° € Ofo(2?,w®). (1.63) 
é=1 
The similarity between expressions (1.62) and (1.63) suggest possibly a new 
class of algorithms for solving stochastic optimization problems, one that relies 
on an approximate probability measure (to be refined as the algorithm pro- 
gresses) to obtain its iterates, allowing for the possibility of a quasi-gradient 
at each step without losing some of the inherent adaptive possibilities of the 
quasi-gradient algorithm. 
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CHAPTER 2 


APPROXIMATION TECHNIQUES IN STOCHASTIC 
PROGRAMMING 


P. Kall, A. Ruszczy ski, and K. Frauendorfer 


2.1 Introduction 


We start this section with a brief discussion of basic difficulties encountered in 
stochastic programming and overview main approaches for overcoming them. 
Next we describe fundamental ideas of approximation techniques, which we 
analyze in more detail in the next sections of this chapter. 


2.1.1 The need to approximate stochastic programming problems 


The basic feature that differs stochastic programming problems from other op- 
timization problems is the way in which the objective function or constraint 
functions are defined. In stochastic programming problems values of some of 
these functions are numerical characteristics of random phenomena dependent 
on the decision variables. In particular, these can be 


(i) mathematical expectations of functions dependent on our decision variables 
and some random parameters, or 


(ii) probabilities of some random events which are controlled by the decision 
variables. 


This feature gives rise to the main difficulty encountered in stochastic program- 
ming problems: the difficulty of calculating values and gradients (or subgradi- 
ents) of the functions defining the problem. 

To discuss this matter in more detail, let us suppose that the objective 
function F(z) in a stochastic programming problem is defined as a mathematical 
expectation of a function f(z, €), where z € R" is the vector of decision variables 
and € is an m-dimensional vector of random parameters. Formally, the objective 
function can be expressed as follows: 


F(z) =E/(2,€) = [ (2, €(w)) P(de), (2.1) 


where ( denotes an abstract probability space and P is the corresponding 
probability measure. In a special case, if € is a discrete random vector attaining 
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only a finite number of values €',€?,...,€" with probabilities p) > 0,p2 > 
0,..-,pr > 0, pean pe =1, we can rewrite (2.1) as 


F(z)= D7 Pes (2,64). (2.14) 


But in another special case, if the random vector € = (€1,€2,-.-,&m) has a 
probability density function (€;,&,.--,€m) the general formula (2.1) takes 
on the form of a Riemann integral 


F()=f f aaa f Tes @e(€) der déas.--sdEm- (2.10) 


We see that to evaluate the objective function F at a given point z it is neces- 
sary to calculate a multiple integral with respect to the measure describing the 
distribution of €. If it is not possible to perform the integration analytically, 
we have to use numerical methods, which usually require much computational 
effort, which increases rapidly with the dimension of € and with the required 
accuracy. 

Straightforward application of common nonlinear programming methods 
(see, e.g., [2], [16], [21]) to stochastic programming problems would require 
calculation of integrals of the form (2.1) at each point 2*, k = 0,1,2,..., gen- 
erated by the optimization algorithm. Difficulties increase if the optimization 
technique needs also gradients VF (z*), k = 0,1,2,..., which in our case turn 
out to be even more difficult to evaluate than the objective. Indeed, if the func- 
tion f(z, €) in (2.1) is continuously differentiable with respect to 2 for all €, 
then, under reasonable additional conditions (cf., e.g. [31]) F(z) is continuously 
differentiable and 


VF (a) = f Ves (es €())P (de) (2.2) 


where V,/(z,€) denotes the gradient of f with respect to x. In the two special 
cases considered above we obtain 


iL 
VF (2) = >> peVJ (2, €) (2.2a) 
a1 


and 


VF (2) =f f asa [| Yelle elEdbrdéas-sdéms (2.98) 


respectively. Since nonlinear programming methods usually need many itera- 
tions to reach a neighborhood of the solution, the total computational effort 
required may be beyond the cost that can be afforded. 

There are two main approaches which overcome the difficulties discussed 
above: approximation techniques and stochastic quasigradient methods. 
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In approximation techniques we replace the original problem with a simpler 
one by approximating the random vector € by another random vector € for 
which integrals (2.1) are easy to handle. Typically, we choose € to be a discrete 
random vector and deal only with sums of the form (2.1a). 

Stochastic quasigradient methods avoid at all computation of integrals of 
the form (2.1). The main idea of these methods is to make random steps in direc- 
tions calculated on the basis of some statistical information about the problem 
gained at each step. Contrary to approximation techniques, they do not tend 
to get a global image of the properties of F(x), but use random values f(z, €*) 
and corresponding gradients V;f(z,€*) (or subgradients in a nondifferentiable 
case) calculated at some sampled realizations €* of €,4 =0,1,2,.... In such a 
way a kind of self-learning method is constructed, in which each particular step 
may be inefficient, but their large number exhibits general statistical properties 
that imply convergence with probability one to a solution. 

Stochastic quasigradient methods are discussed later in this volume, and 
from now on we shall concentrate on the approximation schemes. It is also 
worth mentioning here that recently, in [19], an attempt has been made to 
combine these two approaches. 


2.1.2 Fundamentals of approximation techniques 


When constructing approximations to stochastic programming problems we 
have to analyze the following mutually related questions. 

First. we have to find out a proper way of replacing the original random 
vector € with a discrete one. 

Secondly, we have to study the relations between the original problem and 
the approximate problem and estimate the accuracy of approximation. 

Thirdly, we need a method of improving the accuracy, if it is not sufficient, 
by constructing a better approximation to €. 

Before investigating these problems in detail, let us introduce some basic 
ideas and mathematical properties of this approach. 

Let 2 C R™ be the support of the random vector € (i.e. the smallest closed 
set in R™ such that P{€ € 2} =1) and let S¥ be a finite collection of subsets 
=, £=1,2,...,£, of & satisfying the following conditions: 


L 
J fe=8, (2.3) 
é=1 

E,NE; =@fort #7; 1,7 =1,2,...,L. (2.4) 


We shall call S” a partition of ©. 
For any partition we can rewrite integral (2.1) as follows 


L 
r(e)= f He oree=> f seers) (25) 


Ee 
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where we perform integration over the support = C R™ and use the description 
of the distribution of € in the space of its values. 
In the particular case (2.1b), which is of special interest for us, (2.5) reads 


F(2) - / / - / Ha, €)p(€)dérdes,...,d&m- (2.5a) 


Proceeding as in the simplest method for calculating integrals we can now 
approximate each integral over E¢ as follows 


[ f(x, €)P (dé) ~ £(, €) [ P(dé)=f(2,€)P{EE Ee}. (2.6) 


where €° is a selected representative of the subset Ey. In other words, we 
approximate the function {(z,€) by a step function in €, which is constant in 
each set Ee, £=1,2,...,£. In this way we arrive to the following approximation 
of F(z): 


L 
F*(2) =~ pef (2,€4), (2.7) 
with - 
pe= P{é E Ee}. 


Since by (2.3) and (2.4) we also have ey pe = 1, our approximation can be 
equivalently interpreted as an approximation of € by a discrete random vector é 
attaining values €¢ with probabilities pz, 2 = 1,2,...,, and our approximating 
forrnula (2.7) is exactly of the form (2.1a). 

Generally, if the support © is bounded and if max;<e<z, P{é € Ee} + 0 
as L — oo, then for each z, under reasonable assumptions of f(x,€) we get 
a pointwise convergence of function values: F4(z) + F(z) as L — oo. This 
fundamental and highly desirable property, however, is not sufficient for us, 
because we are rather interested in the convergence of the sequence of solutions 
&,, of approximate problems, or at least of its convergent subsequences to a 
solution of the original optimization problem. Some additional conditions, e.g. 
compactness of the feasible set for x together with the uniform convergence of 
F¥ to F and continuity of F, are needed to ensure such a kind of convergence. 
We shall not go further into the analysis of these theoretical problems; a thor- 
ough discussion of them and various generalizations can be found in [1], [15], 
[$0], [$4]. Still, in many practical problems such conditions are satisfied. It is 
also often the case, that in practice a point & is satisfactory, for which the ob- 
jective value lies within a certain tolerance range with respect to the minimum 
value, and this is possible to achieve for a far broader class of problems. 

Nevertheless, it is still very difficult to determine in advance how fine the 
partition should be to ensure the accuracy of approximation. Division of into 
many small pieces Ez, € = 1,2,...,£, without any strategy may dramatically 
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increase the computational complexity of the approximate problem. To illus- 
trate the difficulties that may arise, let us suppose that there are 10 independent 
scalar random variables in our original problem, so that € = (€1, €2,...,€10). 
If the support of each €;, 7 = 1,2,...,10, is divided into 10 subintervals, we 
get 10!° subsets 2p of the support & of €, a number which is clearly beyond any 
computational capabilities. 

To avoid such excessive numbers of subsets Ze we have to use nonuniform 
partitions which are suited to properties of f(z, €) as a function of €. The 
problem of constructing such partitions is closely related to the way of choosing 
points € € He. Considering only convergence, these can be arbitrary points; 
however, if we choose them more carefully, namely as conditional expectations 


€f = EL E(w) /E(w) € Be} (2.8) 


with probabilities 
pe = P{E(w) € Be} (2.9) 


then we shall not only improve the accuracy of approximation in many cases, 
but also gain information that will help us to properly refine the partitioning if 
the accuracy shall not be sufficient. 

Indeed, if the function f(z, €) is linear with respect to € in the set Ee, then 
with €° defined by (2.8) we obtain strict equality in (2.6), 


| f(x, €)P(dé) = f(a, €)P{E € Be}. (2.10) 
Ee 
This implies that further division of the subset Eg is useless for improving 
the accuracy of approximation at a given z. On the other hand, if f(z,-) is 
highly nonlinear in 2, the approximation in Eg can be rather rough and a 
finer partition of Ez is desirable. Hence, the density of partitioning in various 
subregions of the support 5 should be related to the nonlinearity of f(z,-). 

Generally, we do not know in advance such detailed properties of the func- 
tion f(z, €), some information can be gained only in the course of solving a 
definite approximation problem. Furthermore, the properties of the function 
Jf (z,-) change when « changes, and we are interested in having a good partition 
for z close to the solution of our problem. 

Thus we arrive at an idea of a sequential approximation method in which 
constructing a partition of 2 and approximating a solution to the original prob- 
lem are mutually related: 


(1) Choose an initial partitioning Ee, = 1,2,...,L, which satisfies (2.3) and 
(2.4). 

(2) Choose points é° € = and probabilities pg, £ = 1,2,...,L, according to 
(2.8) and (2.9). 

(3) Solve the approximate problem. 

(4) At the solution 2; analyze the accuracy of approximation by investigating 
properties of the function f(z, €) in each of the subsets Be, £= 1,2,...,L, 
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choose those of them that should be further divided, if the accuracy is not 
sufficient, and repeat step 2. 


Detailed realization of this procedure depends upon properties of the class 
of problems to which it is applied. In the next section we shall describe in more 
detail its application to a certain important class of stochastic programming 
problems. 


2.2 Approximation Schemes for Linear Two-stage Problems of Sto- 
chastic Programming 

In this section we consider a special class of stochastic programming problems, 
so-called two-stage problems, and we describe the realization of the sequential 
approximation method in this case. In 2.2.1 we formulate the problem and 
review its basic properties and in Section 2.2.2 we consider the special case with 
a discretely distributed random vector. Section 2.2.3 is devoted to estimates 
of the accuracy of approximation, which are followed in 2.2.4 by the analysis 
of refining strategies. The special case of so-called simple recourse is discussed 
separately in 2.2.5. 


2.3.1 Basic properties of linear two-stage problems. 
The linear two-stage problem of stochastic programming is defined as follows: 


minimize [¥(z) = To+ | Q(z, €(w)) P(dw)] 
a 


subject to Az=6, 
z>0, 


(2.11) 


where c € R"1, 6 € R™ and A of dimension m1 X 7, are defined as in a 
common linear programming problem. The function Q(z, €(w)) that appears 
in the additional part of the objective in (2.11) is defined as the optimal value of 
another linear programming problem which has z as a parameter and involves 
random coefficients €(w) = (g(w), h(w),T(w)): 


minimize 7 (w)z 
subject to Wy =h(w) —T(w)2, (2.12) 
y 20. 


The linear programming problem (2.12) is called the second stage problem, or 
the recourse problem; it consists in finding the best recourse decision y € Ri, 
when the first stage decision z € R’! and random realization of the parameters 
gw) € R"2, h(w) E R™2 and T(w) of dimension my xn are already established. 
The mz X ng matrix W is deterministic. 

Since the expected value of the minimum recourse cost Q(z, €(w)) modifies 
the objective of the first-stage problem (2.11), the whole model (2.11)-(2.12) 
has a certain internal dynamical structure: when looking for an optimal first 
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stage decision z we have to take into account not only the direct first stage 
cost cz but also the expected value of the future recourse cost. If there is no 
feasible solution to (2.12) we assume Q(z, €(w)) = +00, and this should also be 
considered at the first stage. 

We are especially interested in stochastic programming problems with re- 
course because of their wide application to modeling decision problems which 
involve random data. If some constraints, e.g. Tz = h, in a linear programming 
problem include random coefficients in T or h and we have to take the deci- 
sion before knowing the realizations T'(w) and h(w) of T and h, it is generally 
impossible to require that the equality 


T(w)z = h(w) (2.13) 


be satisfied for each realization of the stochastic constraint parameters. The 
problem with recourse is a way of overcoming these modeling difficulties; the 
recourse decision y may be interpreted as a correction in (2.13), and the recourse 
cost Q(z, €(w))—as a penalty for discrepancy in (2.13). 

In a more general model the matrix W in (2.12) could be random too, but 
for the ease of exposition we assume that it is deterministic; such a model is 
called the problem with fixed recourse. Most of the theory and computational 
methods have been developed for this class of linear two-stage problems. 

Let us review briefly basic properties of the problem (2.11)~(2.12). The 
feasible set of (2.11) is the intersection of the set given by the first stage con- 
straints 

K, ={ze€ R" : Az =6b,2 > 0} (2.14) 


and of the induced feasible set 
K, ={2€ R"! : Q(z, €(w)) < co with probability 1}. (2.15) 


While K, is described explicitly and easy to handle, the induced set K is 
defined implicitly and hard to express analytically. However, if the matrix W 
in (2.12) is such that {Wy:y > 0} =R™? (ie. the corrections Wy in (2.12) 
can cancel any error), we have Kp = R"1. Problems with such a property are 
called problems with complete recourse. In the special case of W = [J,—J] we 
speak about simple recourse. Although generally the induced feasible set Ky 
need not contain K, we still have the following property. 


(a) The sets K,, Ka and K = K,M Kg are convex and closed. 


As far as the recourse cost Q(z, €(w)) is concerned, many interesting the- 
oretical results are available. First, by the theory of duality in linear program- 
ming we know that Q(z, €(w)) > —oo (i.e. the second stage problem is bounded 
from below) if and only if one can find « € R™2 such that Wu < q(w). Since 
the case of unboundedness is of no interest for us, we shall from now on assume 
that the above condition is satisfied for each realization of the random vec: 
tor g(w). Under this assumption the recourse function possesses the following 
properties. 
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(b) For any fixed z € K and any q the function (h,T) > Q(z, € = (q,h,7)) is 
Piecewise linear and convex. 


(c) For any fixed 2 € K and any h and T the function q > Q(z, € = (q,h,T)) 
is piecewise linear and concave. 


(d) For any fixed € = (q,h,T) the function z — Q(z, €) is a convex piecewise 

linear function on K. 

Under the additional condition that the random variable €(w) = (q(w),h(w),, 
T (w)) has finite second moments we finally obtain the following result. 

(e) The function Q(z) = fy Q(z, €(w))P(dw) is finite and convex in K. 

A detailed discussion of properties of linear two-stage stochastic program- 
ming problems can be found in [12] and [85]. 

Properties (a)-(e) are of fundamental importance for the concepts and 
methods discussed in this chapter and will be frequently used in subsequent 
sections. We also assume that we deal with the case of complete recourse (no 
induced constraints). Motivation for the later assumption is rather obvious: 
with Ko # R™ it would be extremely difficult to ensure that solutions to 
approximate problems are in the induced feasible set of the original problem. 


2.2.2 The two-stage problem with a discrete random vector 


Let us consider in more detail properties of stochastic programming problem 
with recourse in case of a discretely distributed random vector € attaining 


values: 
€1 = (q',h',T') with probability p, > 0, 


€? = (q?,h,7?) with probability p2 > 0, (2.16) 
€% = (q",h”,T*) with probability pz > 0, 


where 
L 
Yr =1. (2.17) 
1 


In this case the two-stage problem (2.11)—(2.12) takes on the form 


Z 
minimize [}(z) =c7 2+ peQ(c, €°)] 

é=1 (2.18) 
subject to Az=b 


z2>0, 
where Q(z, €°) is the minimum objective value in the recourse problem 
eadk T 
minimize (q°) y 
subject to Wy =h* — Tz, (2.19) 
y20, 
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€=1,2,...,L. If we denote by 9°(z), 2=1,2,...,L, the solutions to problems 
(2.19) at a given x, we can express the first stage objective as 


B(2) =c72 + > pela’) H(2). (2.20) 
é=1 


Of course, the solutions 9°(z) depend on gz in a rather involved way, so that the 
products (q¢)" 94(2) are piecewise linear (cf. property (d) in 2.2.1). However, 
instead of considering (2.18)—(2.19) as a two-level problem, we can put together 
the first stage problem (2.18) and all realizations of the second stage problem 
(2.19) into a large linear programming model: 


ee ° T Tr a 
minimize c”x+pe(q')° y! +.p9(9*) y? + pig’) y” 
subject to 


Ag = b 
oN oa 2,21 
Ts +Wy? = yp 
T 2 +W y! = pl 

2 >0, yi>oy?>0 ... y2 DO. 


Problems (2.18)-(2.19) and (2.21) are equivalent in the sense that they have 
the same set of solutions, as the first stage decision vector z is concerned, and 
the optimal values of y!,y?,...,y” in (2.21) are solutions to the realizations of 
the second stage problem (2.19) at the optimal z. 

Summing up, a two stage problem with a discretely distributed random 
vector € turns out to be equivalent to a large-scale linear programming prob- 
lem, which can be solved by powerful linear programming techniques, which 
take account of its special dual block angular structure. These techniques are 
discussed in detail in chapter 5 of this volume (see also [13], [28] and [33)}). 
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2.2.3 Error estimates 

Let us now investigate relations between a two-stage problem with an arbitrary 
distribution of the random parameter € and its approximation resulting from 
the discretization of €. Recall that, according to the ideas sketched in Section 
2.1.2, the discretely distributed approximation € to € is constructed for a given 
partition SE = (Ei, 2,...,2z) of the support = of € as follows: 


P{é=€} =m, €=1,2,...,L, (2.22) 
where €!, €?,..., €! are conditional expectations of € in Ey, 

Ca E{E/Eeke}, €=1,2,...,L (2.23) 
and 

pe=P{€E5}, €=1,2,...,L, (2.24) 


L 
se pe=1. (2.25) 
é=1 


We expect (2.23) to be a good choice, since the conditional expectations mini- 
mizes £||€ — é ||? with respect to all discrete distributions corresponding to our 
partition [14]. 

After replacing € in (2.11)-(2.12) by the discrete variable & we obtain 
an approximating problem of the form (2.21). Obviously, this problem is much 
easier to solve than the original one, but now we need estimates of errors caused 
by the approximation. Such estimates can be derived from general properties 
(a)—(d) of two-stage problems, discussed in Section 2.2.1. 


Lower Bounds 


Let us assume that all the subsets Ze, & = 1,2,...,£, are convex and the 
function Q(z, €) in (2.11) is convex in € for each z. By property (b), the latter 
condition is satisfied if g in (2.12) is deterministic, and only T(w) and A(w) vary 
randomly. 

Under this assumption, with €¢ and pe representing conditional expecta- 
tions and probabilities defined by (2.23)-(2.24), for each block Ey from Jensen’s 
inequality (see [14]) we obtain 


i Q(2,€(w))P(dw) > peQ(z, 6), 0 = 1,2,..., L. (2.26) 
Ee 
Thus for any z we have 


vle)=cTet f Ole. €{u))P(de) 2 
L (2.27) 
> eTe+ >) peQ(z, €*) = ¥(2). 


é=1 
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Hence, the objective value in the approximate problem (2.18) is a lower bound 
for the true objective value at a given x. Furthermore, in (2.18), or its extended 
LP form (2.21), we minimize #(z) and therefore the minimal se (2), where 
% solves (2.21), is a lower bound for the least value of the true objective: 


(x) > $(2) for all feasible z. (2.28) 


Another important feature of the Jensen’s lower bound is its monotonicity: 
if S'+* is a refinement of the partition S” (i.e. results from S” by division of 
some of its members), then a lower bound obtained for S/*+? is at least as good 
as the previous one (see [8]). 

A more thorough discussion of applications of Jensen’s inequality in sto- 
chastic programming can be found in [$], [8], [9], and [14]. 

One can also exploit the convexity of the function Q(z,-) by approximating 
it from below by a piecewise linear function (Q(z,-) is piecewise linear itself, 
but may contain a very large number of pieces). 

By the duality theory in linear programming 


Ole €()) = mints" yy = he) —Tw)ey 20} ay 
= max{(h(w) —T(4)2)"u|W"u < 9}, : 
where u € R™2 is the vector of multipliers in (2.12). If we, 2=1,2,...,L, are 
some feasible solutions to the dual program, then 


Q(z, €(w)) = max, (h(w) - —T(w)z)" ue = Q(z, €(w)). (2.30) 


For a deterministic g the feasible set W7u < q in the dual problem (2.29) does 
not depend on w, hence we can substitute for ue dual solutions to the second 
stage problem of any x and with any €(w) = (T(w),h(w)). In particular, if we 
choose ue to be optimal multiplier vectors at ¢° for a fixed z, then the graph 
of the linear function Q,(€(w)) = (h(w) — T(w)x)? ue will support the graph of 
Q(z,:) at €°. Finally, taking the expectation of both sides of (2.30) we obtain 
a lower bound for ¥(z): 


$(2) >c72+ Ef max, (h(w w) —T(w)2)7 ue} = J(2). (2.31) 


The two methods for calculating lower bounds are illustrated in Figure 2.1 
and Figure 2.2. We see from these figures that the lower bound (2.27) results 
from approximating the function Q(z,-) by a step function Q(z, -) attaining in 
Ee the values Q(z, €°), 2=1,2,..., L, while the lower bound (2.31) results from 
approximating Q({z,-) by a convex piecewise linear function Q(z,-) defined by 
supporting hyperplanes at €°. The second approximation can be more accurate 
and the resulting bound sharper at a given z, but the evaluation of (2.31) re- 
quires an additional integration of the approximating piecewise linear function 
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Q(z, €(w)). Another difference is that Jensen’s inequality is in some way consis- 
tent with the approximating problem (2.21) and provides a lower bound (2.28) 
for the minimum objective values, which in general is not true for (2.31) (to get 
a global lower bound one would have to minimize the right-hand side of (2.31) 
instead of solving (2.21)). An extensive discussion of the above techniques for 
constructing lower bounds can be found in [3], [8], and [14]. 


Q{x, &) 





_ 


3 


Figure 2.1 Lower bound by Jensen’s inequality 
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A(x, &) 





Figure 2.2 Lower bound by piecewise linear approximation 


Upper bounds 


Since in general we are not able to evaluate #(z) exactly, we need also upper 
bounds on the objective value to compare them with our estimates of the mini- 
mum. Such bounds can be obtained from the Edmundson-Madansky inequality 
for expectations of convex functions. 

To explain the main idea of constructing an upper bound, let us assume 
that € is a one-dimensional random variable with a support E = [a,b]. Define 
now € to be a discrete random variable attaining values: 


_ ¢0 
a with probability 9) = faa 





ae (2.32) 


b-a’ 





6 with probability py = 


where €°9 = EE = ff i €P(dé). The Edmundson-Madansky inequality, when 
applied to our problem, says that 


EQ(z, €) < EQ(z, 6), (2.33) 


provided that Q(z,-) is convex. Indeed, the convexity of Q(z,-) implies that 
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(6 — a)Q(z, €) < (6 - €)Q(z, a) + (€ - a) Q(z, b) for each € € [a,b], hence 


=| ” (2, €)P (de) 


P(dé) (2.34) 














= Q(z, a) + é Q(z, 6) = EQ(z, €). 

If € is an m-dimensional random variable with independent components 
distributed in intervals [a;,6;] with expectations &, j= 1,2,...,m, inequality 
(2.33) holds with a variable € having independent components é; distributed 
in points a; and b; according to (2.32) (see [8], [9], [82]). The variable é 
constructed in this way is a discrete random variable attaining values only at 
vertices of the rectangle =X, [a;,b,]. 

The distribution of € may be viewed as an extremal distribution in the 
following sense: among all distributions with support © and the same expecta- 
tion €° = (€2, €2,...,€2,), for any convex function g : Z — R! the distribution 
of € provides the maximum of the expected value of y (cf. [6], [86], [$2]). 
This property explains the essence of the upper bound (2.33) and can also 
be used for constructing worst-case approximations to stochastic programming 
problems (see Section 2.4). 

Let us now consider the partition of © into rectangles 


Be= X [a5,09) Es Oy poetien Fo (2.35) 
Obviously, 
L 
£Q(2,) =) | e,€)P(ae). (2.36) 
é=1 Be 


Each of the integrals in (2.36) can be estimated from above according to the 
Emundson-Madansky inequality, with the expected value of € replaced by the 
conditional expectation €° of € in Ee. This yields the upper bound 


ZL 
EQ(z,€) < > pe Q(z, €°) (2.37) 
é&1 


where each € j is defined for the corresponding subset Ze according to (2. 32) 
= €° replaced by the conditional expectations & = E{é;/é; € laf, be Abe 

= 1,2,...,m. Two equivalent interpretations of “this procedure in a nae 
diaehanaal case are illustrated in Figure 2.3 and Figure 2.4, while in Figure 2.5 
we show how an upper bound is constructed for a given = in a two-dimensional 


case. 
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Q(x, &) 





Figure 2.8 Upper bound by Edmundson-Madanski inequality 


One can use inequality (2.37) in two ways. First, directly from (2.37) we 
obtain an upper bound for the value of the objective at any given point z 


L 
(2) =e? 2+ EQ(2,€) <c72+D” peEQ(z,€') = ¥(z). (2.38) 
e=1 


We usually calculate this upper bound at the solution % of (2.21), by solving 
the second stage problem at z and at each vertex ee, é€=1,2,...,L, Vv = 
1,2,...,2™ of our partition (note that most of vertices are common for many 
subsets). 

Secondly, we can estimate from above the minimum value of ¢(z) by finding 
a point which solves the problem 


L 
minimize [d(2) =o'g+ >) peEQ(2, é*)] 
1 (2.39) 
subject to Az = 8, 


220. 


From (2.38) we get 
min (x) < $2) < ¢(2) (2.40) 
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Figure 2.4 Equivalent interpretation of the upper bound in one-dimensional 
case 


Problem (2.39) can be equivalently formulated as a large-scale linear program- 
ming problem of the same structure as (2.21). Indeed, 


qm 


EQ(, &) = >> p” Q(z, &), (2.41) 


vol 


where && are vertices of the subsets Ee, and the probabilities p are defined 
as follows 


pe = P{e = §} = |] PLES = &}. (2.42) 


j= 


Each of the factors pe = P{é& = é%} is defined as in (2.32) with a, 6 and 
€° replaced by a, b€ and the conditional expectation éf of €; in (af, bf). The 
number of blocks in the resulting linear programming problem will be equal to 
the number of vertices of our partition. 

Consequently, on the one hand #(#) is a better upper bound than (2), 
but on the other hand its calculation requires solution of an additional large 
scale linear program. 

Analogously to the Jensen’s inequality, the upper bounds (2.40) possess 
the property of monotonicity: if we refine the partition (i.e. subdivide some of 
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Figure 2.6 Upper bound in multi-dimensional case 


its members), the new bounds will be at least as good as the previous ones (see 
[14], [8]). 

We end this section by noting that in the absence of convexity of Q(z,:), 
which was crucial for our previous considerations, one can still derive some error 
bounds for linear two-stage problems (see [11]); the results are rather of theoret- 
ical than computational nature and substantiate convergence of approximation 
schemes for problems with recourse. 
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2.2.4 Refining Strategies 


In the previous section we discussed methods for estimating errors that result 
from approximating the random variable € by a discrete one defined by a parti- 
tion ),23,...,2¢ of the support & of €. Let us now consider the question how 
this partition should be refined so as to improve the accuracy of approximation. 

The simplest and most obvious technique of refining is to cut each subset 
Ee, €=1,2,...,£, by hyperplanes orthogonal to coordinate axes in R™. If the 
subsets Ee are hypercubes in R™, this strategy divides each Eg into 2” smaller 
cubes, hence after & steps we shall get L = 2* subsets. Consequently, the size 
of the approximating linear programming problem (2.21) will increase so fast 
that after a small number of refining steps we shall no longer be able to solve 
it. 

However, a more careful analysis of our problem shows that the computa- 
tional effort can be considerably reduced by dividing only some properly chosen 
subsets along appropriate directions. 

Let us at first discuss the question of selecting subsets (and the corre- 
sponding blocks in (2.21)) that should be further divided. To this end let us 
recall our results concerning error bounds and formulate them for subsets Ee, 
€=1,2,...,L. The lower bound %¢(Z) for Sz, Q(z, €)P(dé) we get from (2.26) 


ve(#) = peQ(z, €), (2.43) 


while the upper bound is given by (2.41): 


gm 
bel) = pe, p” Q(z,E”). (2.44) 


y=l 


It is now obvious that we need to divide only such blocks, for which dif- 
ferences between upper and lower bounds exceed the assumed tolerance. These 
differences depend on properties of the function Q(2,-) in Ee ; as mentioned 
in Section 2.1.2, if Q(2,-) is linear in Ee then there is no approximation error 
in this subset, be(2) = #e(Z), and further division of =, will not improve the 
accuracy of approximation at Z. On the other hand, nonlinearity of Q(Z,-) in 
Ee leads to differences between be(2) and pe(z) that indicate the necessity of 
dividing Ee. 

Let us now discuss the choice of the direction along which a subset Ee 
should be split. Again, the efficiency of cuts in different directions is related to 
the linearity of the function Q(z, €) with respect to coordinates €1,€,..-,&m 
of €. As we see from the example in Figure 2.6, no improvement can be gained 
by splitting Eg with a cutting plane orthogonal to the coordinate €, in which Q 
is linear. On the other hand, if we cut &¢ by a plane orthogonal to 2 we may 
obtain two subregions in which Q(z, €) will be linear in €, and our next upper 
and lower bounds in these subsets might become exact. 


Approzimation Techniques 51 





Figure 2.6 Illustration of the strategy of partitioning 


Generally, it is very difficult to divide sets Eg into subregions in which 
Q(2,-) is linear. Moreover, it is convenient to use cutting planes orthogonal to 
coordinate axes in R™, since independence of the components €,, €2,...,&m in 
rectangular subregions is useful for calculating upper bounds. Still, for each &, 
selected to be further divided we can choose the coordinate along which Q(Z,-) 
is “mostly nonlinear”. 

How can we estimate the extent of the nonlinearity of Q(2,-) with respect 
to €1,&2,..-, &€m in the subset Ee? Let us observe that for calculating the upper 
bound (2.44) we solve the second stage problem 


minimize g?y 
subject to Wy=h” —T%E (2.45) 
y 2 0, 


at each vertex é&, vy =1,2,...,2™ of the rectangle Ze. From the theory of 
duality in linear programming we know that the vector of multipliers (prices) 
x” corresponding to the constraints in (2.45) is a measure of sensitivity of Q 
with respect to the right-hand side h—T' at €” = (A” , 7”). If the multipliers 
are the same at each vertex then Q(Z,:) is linear in Eg; otherwise we can select 
a direction in which the multipliers change most rapidly. One can use here 
various methods for comparing differences between multiplier vectors, giving 
rise to many particular strategies of refining, but the basic idea wil always be 
to avoid inefficient cuts. 

After dividing some of the subsets Ee we shall have to solve the approximate 
problem (2.21) again, with a larger number of blocks. This will give us a new 
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point %, at which we shall have to repeat our analysis of upper and lower bounds 
and again select subsets to be divided and directions of cuts. 


2.2.5 The case of simple recourse with random right-hand sides 


In this section we improve and simplify our previous results concerning error 
bounds and refining strategies in a special case of linear two-stage problems of 
stochastic programming, so called problems with simple recourse. The main 
feature of these problems is that the matrix W in (2.12) is of the form 


W ={I,-]] 


where J denotes the identity matrix in R™2. We shall also assume that the cost 
vector g and the matrix T in (2.12) are deterministic, and only the right hand 
side h(w) is random. 
After substituting Tz = x and y = [y+,y7], ¢ = [g+,q7] we can rewrite 
the second stage problem (2.12) as follows 
minimize (qt)7yt + (q7)?y7 
subject to yt —y” =h(w) — x, (2.46) 
y' 20,y° 20. 
We shall denote the optimal value of this problem by Q(x,h). 
Owing to the special form of constraints in (2.46), we can now split it into 
mq independent linear problems with only one constraint: 
ee . + + es as 
minimize 4; Y; +4; Y; 
e + aan . 
subject to Yp y= h;(w) - Xj (2.47) 
yp 2 O.y; 2 0, 
j = 1,2,...,m 9. If we denote the optimal objective values in subproblems 
(2.47) by Q;(x;,4;), 9 =1,2,...,:ma, we may write 


Genes oseinD. (2.48) 


j=l 


It is the above separable structure of the two stage problem (2.46) that sub- 
stantially simplifies error bounds and refining strategies. 

Before we pass on to this matter, let us briefly discuss conditions of solv- 
ability of the second stage problem. Observe that if Ge +q <0 for some J, 


then (2.47) has an unbounded solution: y; =t, oH =t+h;(w) -—x;,t +0, 
for which Q; = t(q; + q; ) + q; (h;(~) — x;) + —00, as £ + +00. Conversely, 
for q; +4; 20 problem (2.47) has an optimal solution defined as follows: 


if h;(w) — x; > 0 then vy; =h;(w) - Xj 9¥; = 95 


; : (2.49) 
if h;(w) — xj; <0 then y? =0,y5 = —h,(w) + x;- 
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Therefore the condition gt +q~ > 0 is necessary and sufficient for solvability of 
the second stage problem (2.46) at any h(w) and any xy = Tz (cf. [12]). From 
now on we shall assume that this condition is satisfied. 

The first important observation concerning our problem is that the ex- 
pected value EQ(x,/(w)) of the recourse function can be calculated exactly at 
any x = Tz. Indeed, by (2.48) 


Ea(xsh()) = SQ; (xy, hj le) (2.50) 


j=1 


where, according to (2.49), each Q; is of the form 


qj (hy(w) — xz) if Ay(w) > xz, 
q; (xj —Aj(w)) if hj (¥) < xj, 


Q;(x7,h;(w)) = 


The dependence of Q;(x;,4;) on A; is illustrated in Figure 2.7. By the linearity 

of this function in the regions {h;(w) > x;} and {h;(w) < x;} we obtain 
EQ; (xj+h3(w)) = af (hf (x5) — xs) PF (xz) 

es - ~ (2.51) 

+495 (x; — hy (x3))p; (xs) 


where 


hi (x3) = E{h;(w)/h;(w) > xz}, 
hs (xj) = E{h;(w)/h;() < x5} 


are conditional expectations of h; in the areas of linearity of Q;(x,;,-), and 


P; (xj) = P{hj() > xz}; 
B; (xz) = P{h;(w) < x3} 


are the corresponding probabilities. The function EQ;(x;,;(w)) is illustrated 
in Figure 2.8, where [a;,6,] denotes the support of h;(w). We see that if Xj < aj, 
then P} (x;) =1, P; (x;) = 0 and the function is linear in x; with the slope 
-q;- An analogous situation occurs for x; > 6; and the slope is equal to qj ° 
Within the support of h(w) the function is convex and its minimum depends 
on q; 3 q; and of course on the distribution of h; (w). 

From (2.50) and (2.51) we finally get 


EQ(xsh(w) =) lay (RF (xs) — xs)0F (x3) 


+95 (x3 — hy (x3); (xa)]- 


(2.52) 


Practical application of formula (2.52) is relatively easy, since it requires only 
one-dimensional integration for calculating the quantities Pj s Psy ht and hy at 
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Figure 3.7 The j-th component of the recourse cost in a problem with simple 
recourse 
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Figure 2.8 The j-th component of the expected recourse cost and its ap- 
proximation 
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a given x,j, contrary to the general two stage problem, where multidimensional 
integrals would have to be evaluated. 

Since we can exactly evaluate the objective ¢(x) = c?z + EQ(T'z,h(w)) 
at any z, we no longer need upper and lower bounds for this value. One 
may ask here, whether we need approximation methods at all, if the objective 
values can be computed easily. There is no general answer to this question, but 
approximation schemes may still prove useful, since the approximating problems 
are linear, while the original one is nonlinear in general, as we see from Figure 
2.8. But if we use approximation methods we shall still need lower bounds for 
the minimum yalue of the objective 7(x) and appropriate refining strategies. 

Similarly to the way of evaluating the objective, both these operations— 
calculation of lower bounds and refining of the partition—can be carried out 
separately for each coordinate of h(w). Let the coordinates h;(w) be distributed 
in intervals [a,,6,;], 7 = 1,2,...,m, so that the support of h(w) is contained 
in the hyper-rectangle 2 = Xa {a;,6;]. In an analogous way to the case 
of complete recourse, we solve at first the approximating linear programming 
problem (2.21) with only one block &! =: 


minimize 72+ (g*)Tyt + (97)? y7 
subject to Az =b, 
Tetly*t —Iy =H, 
z>0,yt >0,y 20, 


(2.53) 


where h! = Eh(w). Let (,9+,9 ) be a solution to this problem. Then obvi- 
ously each pair (37555), j =1,2,...,mg, is a solution to the j-th piece of the 
second stage problem: 
minimise off +69 
subject to uy; -y = hj — X;, (2.54) 
v7 20,y; 20, 


with X; being the j-th coordinate of Tz (cf. (2.47)). Since h} = Eh;(w) and 
the function Q;(X;,:) is convex (see Figure 2.7), from Jensen’s inequality (cf. 
Section 2.2.3) we obtain 


Qi(%j 145) S EQ;(Xj hsv), F=1,2,...5m2, (2.55) 


and 


m3 
TE + DQsltinhs) < , min, fete + £Q(T2,h(u))] 


a (2.56) 


m3 
<eP#4+)° £Q;(%;,h;(w))- 
j=l 
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The left side of (2.56) we obtain from the approximating problem (2.53), while 
the right side represents the objective value at Z and can be calculated by (2.52). 

If we had equalities in (2.55) for all 7, the point 2 would minimize our 
original objective function, as follows from (2.56). If this is not the case, the 
differences between both sides of (2.55) are measures of accuracy of our ap- 
proximation with respect to each coordinate h,;(w) of h(w), 7 = 1,2,...,ma. 
Hence, we can select the coordinates for which the accuracy is not sufficient 
and split the corresponding intervals [a;,6;]. It follows from Figure 2.7 that it 
is most efficient to divide them at X;, since the function Q;(X;,-) will be linear 
in the resulting subintervals [a;,x;] and [x;,6;]. Obviously, x; € [a;,6,] for 
the selected coordinates, since otherwise we would have either hy (xj) = A} or 
hj (xj) = h} and an equality in (2.55) (see Figure 2.8). 

The partition of the intervals [a;,6;] defines a new partition of the rectangle 
E into subregions ©),22,...,£,. With this partition we solve (2.21) again and 
obtain a new point % for which the analysis of accuracy can be also carried out 
component-wise. Indeed, in each subinterval [ak oF) of [a;,6;] we have Jensen’s 
inequality similar to (2.55), 


Q; (Kj AE) < E{Q; (Xj; (w)) /A;(~) € [oF 57)}, (2.57) 


where hk is the conditional expectation in [ak, bf) 


Ak = E{h,(w)/hs(w) € [at b8)}. (2.58) 


In a similar way to (2.52) we can also calculate the exact value of the conditional 
expectation E{Q;(Xj,h;(w))/hj(w) © [ak ok)}. Again, if X ¢ [a}, bf) then 
Q;(X;,°) is linear in [af ok) and (2.57) becomes an equality. Therefore we divide 
only those subintervals, for which x; € [at ; bf) and the corresponding accuracy 
in (2.57) is not sufficient (i.e. at most one subinterval for each component). 


This strategy of refining is illustrated in Figure 2.9. 
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Figure 2.9 The strategy of partitioning in problems with simple recourse 


2.3 Chance Constrained Programming 

Another way of formulating optimization models for problems which involve 
random parameters is the use of chance constraints. If in our linear model with 
objective eT z and constraints T's > h, z > 0 some entries of the matrix 7’ or the 
right-hand side h are random, we can formulate the corresponding optimization 
problem as follows 


minimize c7 2 
subject to P{T(w)z >h(w)} >a, (2.59) 
z>0, 


where 0 < a < 1 is a prescribed reliability level. Problem (2.59) is called 
the stochastic programming problem with joint chance constraints. Another 
possibility of formulating such constraints is to impose reliability levels for each 
row of the relation T(w)2 > h(w) (so called disjoint chance constraints), which 
yields the problem 

T 


minimize c*z 
subject to P{T;(w)z>h;(w)}>a;, 7 =1,2,...,m, (2.60) 
a> 0, 


with 0 < a; <1, and 7;(w) indicating the j-th row of T(w). Problems (2.59) 
and (2.60) can be regarded as natural generalizations of common linear pro- 
gramming problems to the case of random constraint coefficients. 
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These problems, however, are no longer linear, since the constraint func- 
tions 
g() = P{T(w)z > h(w)} (2.61) 
and 
9;(%) = P{T;(w)2 >h;(w)}, 7 =1,2,...,m (2.62) 
are in general nonlinear. Moreover, it may turn out that these functions are 
not concave and the feasible sets 


X1(a) = {2 > 0: g(x) >a} (2.63) 
X2(a1,02,.--,Qm) = [) X2s(a4), (2.64) 


Xo;(a;) = {x 2 0: 9;(2) 2 a;}, 
may be nonconvex and even disconnected. An extensive discussion of convexity 
properties of chance-constrained programs can be found in [13], [24] and [85]. 
Below we summarize only the simplest results. 

If only the right-hand side h(w) is random, then the sets X9;(a;) are 
convex for all 0 < a; < 1, 7 = 1,2,...,m. Convexity properties of the set 
X (a) depend, however, on the distribution of h(w). From the general theory 
of so-called logarithmic concave and quasi-concave probability measures (cf. 
[22], [24], [4] and [27]) it follows that for a normal distribution of h(w) the set 
X(a) is convex and closed for each O< a <1. 

When also the technology matrix 7’ is random, up to now no general con- 
vexity statements are available. We know that X2;(a;) are convex for normally 
distributed T;(w) and h;(w), under the condition that 3 < a; <1. Special con- 
ditions have also been found for some other particular distributions (see [17], 
(as). 

Let us now discuss possible approaches to solving chance-constrained prob- 
lems of the type (2.59) and (2.60). The most straightforward one is to use non- 
linear programming techniques for constrained optimization. These techniques, 
however, require calculation of constraint functions (2.61) or (2.62) and their 
gradients (if they exist) at successively generated points, which in general is 
a rather difficult task involving multidimensional integration. Still, with only 
the right-hand side h(w) random and for some special classes of distributions, 
application of fast simulation techniques (cf. [5]) makes this approach effective, 
as practical examples of [25] and [26] show. 

We may also try to approximate (2.59) or (2.60) by another optimization 
problem which would be easier to solve. 

One approach is to approximate the random variable €(w) = (hk(w),T(w)) 
by a discretely distributed one. If é is such an approximation, with 


P{é = (h1,T")} = pi >0, 
P{E=(h?,T*)} = p>, (2.65) 


P{E=(h¥,T4)} = pp >0, 
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£L 
So pe =1 
é=1 


then the problem that approximates (2.59) takes on the form 


minimize 72 
L 
subject to g(z) = SS pere(z) >a (2.66) 
é=1 
z>0, 


where for £= 1,2,...,L 


1 fT’z >h? 
= rai, 2.67 
ve(2) { 0 otherwise. ( ) 


Interesting results concerning the convergence of the feasible set of (2.66) to 
the feasible set of (2.59) as the accuracy of discretization increases have been 
obtained in [29]. So far we do not know much about the practical efficiency 
of this approach. It may be, however, limited by the fact that the functions 
(2.67) are discontinuous and the feasible set of (2.66) may be nonconvex and 
disconnected, even in the feasible region of (2.59) is convex. 

Another possibility is to replace (2.59) by a two-stage problem. 


minimize ef ¢ + EQ(z, €(w)) (2.68) 


where Q(z, €(w)) is the minimum objective value in the second stage problem 
minimize q'y 
subject to Wy >h(w) -—T(w)s, (2.69) 
y>0 


with some g € R™, g > 0, and a certain recourse matrix W. The simplest 
choice of these parameters would be ¢; = M, j = 1,2,...,m, with some large 
M > 0 and W = I (simple recourse). The idea of this approximation is to 
introduce the penalty q’y for not satisfying the constraints T(w)z > h(w). We 
can see it directly in the simple recourse case: the solution y(z,w) to (2.69) is 
given by 


yj (#,w) = max(0,4;() —T;(w)z), f= 1,2,...,m, (2.70) 


and E{q7y(z,w)} is an average cost of violating the constraint T(w)z > h(w). 

Problems (2.68)-(2.69) and (2.59) are not equivalent, but under reasonable 
assumptions one can prove that the probability of satisfying T(w)% > h(w) at 
the solution @ to (2.68)-(2.69) tends to 1, as g; + +00, 7 = 1,2,...,m. Of 
course, in practice we shall have to experiment with values of g so as to achieve 
the required level of probability of satisfying chance constraints with reasonable 
values of 7 2. 


60 Stochastic Optimization Problems 


2.4 Game-theoretic Models and Worst-case Approximations 
So far we analyzed stochastic programming models in which distributions of 
random parameters were known, and our main concern was to find efficient 
solution techniques. However, in practice we often encounter stochastic prob- 
lems in which statistical properties of some parameters are known only to a 
certain extent, e.g. only their supports and expected values are available. In 
such situations a fundamental question arises, whether it is possible to properly 
define 2 concept of a solution and to develop methods for finding such solutions. 
We shall show that a special approximation of the original problem, so-called 
worst-case approximation, may help us to answer these questions. 

To formulate the problem under consideration more precisely, let us as- 
sume that the objective function of our optimization problem is defined as a 
mathematical expectation 


F(e) =Es (ee) =f s(e.6)P.(de), (4.71) 


where z € R" denotes the decision vector, E, C R™ is the support of the vector 
of random parameters €,P, is the probability measure on ©, describing the 
distribution of £, and f: R° x R™ — R!. Next, suppose that the distribution 
of € is not known exactly; we know only a certain outer approximation 5 Cc R™ 


of the support =,, 


=. cg, (2.72) 
and expectations of some functions g1,92,---,9ks 
Ep:(é) = [ gil€)Pa(dé) = us, 6=1,2,...,h (2.73) 
” 


In particular, equations (2.73) may represent our knowledge about the moments 
of €: setting, for instance, k = m and g;(€) = &, 7 = 1,2,...,m, we obtain 
from (2.73) conditions on the expected value of €, E€; = uj, 7 =1,2,...,m. 

Since the distribution of € is not known, we are not able to calculate or 
approximate with a reasonable accuracy the value of the objective F(z), and 
thus looking for a vector z that minimizes (2.71) is out of question. We have 
to reformulate the problem in such a way that the new formulation will involve 
only information that is really available. A game-theoretic approach initiated 
in the area of stochastic programming in [10], [86] provides a way to overcome 
this difficulty. 

Let P be the class of probability measures P on R™ satisfying the following 
conditions: 

P(&) =1, (2.74) 


[otertae) =p;, t=1,2,...,k. (2.75) 


It follows from (2.72) and (2.73) that the “true” measure P, belongs to P; on 
the other hand all measures P € P cannot be distinguished on the basis of the 
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information available to us. Therefore it seems reasonable to assume the worst. 
case and consider the function 


F(z) = ap [sear tes) (2.76) 


Obviously, for each z we have F(z) > F(z), hence after minimizing F with 
respect to 2 the value of the “real” objective will be at least as good as the 
value of F. 

The definition of F involves only information that is available, but it re- 
quires the operation of maximization with respect to probability distributions, 
which in general is extremely difficult and unsuitable for practical calculations. 
Still, it turns out that in many important cases we are able to carry out this 
operation analytically, and the distribution at which the maxinmm of the inte- 
gral in (2.76) is attained does not depend on z and has a special and easy to 
handle form. 

Let us assume that = is a convex, closed and bounded polyhedron, and 
let €”, v =1,2,...,N denote its vertices. Furthermore, let the functions g,, 
¢ =1,2,...,, in (2.75) be linear and the function f(z, €) be convex in € for 
each z. Then one can prove that the supremum in (2.76) is attained at a 
measure P (generally dependent on z) concentrated at vertices €”: 


P({€’})=p,, v =1,2,...,N, (2.77) 


p20, v=1,2,...,N, (2.78) 


N 
Yow =1. (2.79) 
v=l 


Values of probabilities p, associated with the vertices of = should satisfy, besides 
(2.78) and (2.79), the following equations that result from (2.75): 


N 
D> pegilé’) = ai, $= 1,2,...,k. (2.80) 
v=1 

Hence, F(z) is the optimal value of the linear programming problem 


N 
maximize z,€” 
{pv} 2 Pel Ve) (2.81) 


subject to (2.78), (2.79) and (2.80) 


Obviously, problem (2.81) is much easier to sove that (2.76) and makes the 
concept of worst-case approximations implementable. 

In some cases the task of calculating the upper bound F(z) may be simpli- 
fied even further, because it may turn out that the feasible set of (2.81), defined 
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by (2.78)-(2.80), contains exactly one point. To illustrate this possibility, sup- 
pose that € is a scalar random variable, = = [a,6], and additional conditions 
(2.73) comprise only one equation regarding the expected value: E€ = yp. Then 
(2.78)—(2.80) uniquely determine the probabilities describing the extremal dis- 
tribution P: 


ox 


s —H 
P({a)) = = 24, 
r pa 
P((6)) =m = B=. 
These probabilities do not depend on z, hence our worst-case approximation re- 
solves itself to replacing the random variable € in (2.71) by the random variable 
é, attaining values @ and 6 with probabilities p; and po. 

It is interesting to observe that the distribution of é defined in this way is 
identical with that used for the Edmundson-Madansky inequality (cf. (2.42), 
(2.43)), which is quite natural, because essentially we consider the same problem 
of finding an upper bound for the integral (2.71). 

The above observations can be easily extended to the multidimensional 
case, provided that © is a hyper-rectangle X™ [a;,6;], f(z, €) is separable with 
respect to the coordinates €,, i.e. 


#(2,8) = So Siles8) (2.88) 





(2.82) 





and conditions (2.73) are of the form 
Fé; =uj, F=1,2,...,m. (2.84) 


Under these assumptions, the worst-case approximation to (2.71) can be ob- 


tained by replacing € with a discrete random vector € having coordinates 
€),j =1,2,...,m, distributed similarly to (2.82): 


2 bj — wi 
EUSP og 

, es (2.85) 
Meat Y 

PEs = as 


The above result can be directly applied to two-stage problems with simple re- 
course (cf. Section 2.2.5), since objective functions of these problems possess the 
required property of separability with respect to the coordinates of the random 
vector h, see (2.48). One can further extend this result to some problems with 
a nonseparable objective f(z,€). Namely, assuming that we know in advance 
that the coordinates €;, 7 =1,2,...,m, are independent random variables, we 
can restrict the class of measures considered to such probability measures on 
R™ that satisfy (2.74), (2.75) and can be expressed as products of measures 
with respect to the coordinates. Under this assumption, for a hyper-rectangle = 
the worst-case distribution does not depend on z and is defined again by (2.85). 

Interesting extensions and generalizations of the idea of worst-case approx: 
imations in stochastic programming can be found in [6] and [7]. 
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CHAPTER 3 
LARGE SCALE LINEAR PROGRAMMING TECHNIQUES 
R.J-B. Wets 


We study the use of large scale linear programming techniques for solving (lin- 
ear) recourse problems! whose random elements have discrete distributions 
{with finite support) more precisely for problems of the type: 


find ze Ri} 
suchthat Az=b (3.1) 
and z=cz+ Q(z) is minimized 


where 


L 
Q(z) = So reQ(z, é*) os E{Q(z, E(w))} (3.2) 
@=1 


and for each ¢ = 1,...,£, the recourse cost Q(z, €°) is obtained by solving the 
recourse problem: 


Q(z, €°) =inf{q’y|Wy = A° —T*s,y € RY} (3.3) 
where 


@ ,é@ e @ é é € € é é 
= (9 yh ir ) = (doer ngiPireses Mma stiiyessstiny ye ++ stmgny) 


(oe RY with N =ng+mo+mo:7} 


and 
pe = Prob [é(w) 4 e4]. 


! The potential use of large scale programming techniques for solving sto- 
chastic programs with chance-constraints appears to be less promising and has 
not yet been investigated. The approximation scheme for chance-constraints 
proposed by Salinetti, 1983, would, if implemented require a detailed analysis 
of the structural properties of the resulting (large-scale) linear programs. Much 
of the analysis laid out in this Section would also be applicable to that case but 
it appears that further properties—namely the connections between the upper 
and lower bounding problems—should be exploited. 
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The sizes of the matrices are consistent with z € R"!,yeé R"2,b€ R™! and for 
all 2,he € R™2; for a more detailed description of the recourse model consult 
Part I of this Volume. Because W is nonstochastic we refer to this problem as 
a model with fixed recourse. The ensuing development is aimed at dealing with 
problems that exhibit no further structural properties. Problems with simple 
recourse for exampie, i.e. when W = (I,—J), are best dealt with in a nonlinear 
programming framework, cf. Chapter 4. 

Before we embark on the description of solution strategies for the problem 
at hand, it is useful to review some of the ways in which a problem of this type 
might arise in practice. First, the problem is indeed a linear recourse model 
whose random elements follow a known discrete distribution function. In that 
case either g or A or T is random, usually not all three matrices at once, but 
the number of independent random variables is liable to be relatively large and 
even if each one takes on only a moderate number of possible values, the total 
number I, of possible vectors €© could be truly huge, for example a problem 
with 10 independent random variables each taking on 10 possible values leads 
us to consider 10 billion (= L) 10-dimensional vectors €£, Certainly not the 
type of data we want, or can, keep in fast access memory. 

Second, the original problem is again a stochastic optimization problem 
of the recourse type but (3.1) is the result of an approximation scheme, either 
a discretization of an absolutely continuous probability measure or a coarser 
discretization of a problem whose “finite” number of possible realizations is too 
large to contemplate; for more about approximation schemes consult Chapter 
2. In this case L, the number of possible values taken on by €(-:), could be 
relatively small, say a few hundreds, in particular if (3.1) is part of a sequential 
approximation scheme, details can be found in Chapter 2, see also Birge and 
Wets [2], for example. 

Third, the original problem is a stochastic optimization problem but we 
have only very limited statistical information about the distribution of the 
random elements, and €',...,€” represents all the statistical data available. 
Problem (3.1) will be solved using the empirical distribution, the idea being of 
submitting its solution to statistical analysis such as suggested by the work of 
Dupatova and Wets [7]. In this case L is usually quite small, we are thinking 
in terms of L less than 20 or 30. 

Fourth, problem (3.1) resulted from an attempt at modeling uncertainty, 
with no accompanying statistical basis that allows for accurate descriptions of 
the phenomena by stochastic variables. As indicated in Chapter I, this mostly 
comes from situations when there is data uncertainty about some parameters (of 
a deterministic problem) or we want to analyse decision making or policy setting 
and the future is modeled in terms of scenarios (projections with tolerances for 
errors). In this case the number L of possible variants of a key scenario that we 
want to consider is liable to be quite small, say 5 to 20, and the ¢° can often 
be expressed as a sum: 


=p te gl t+...taKxeg* 
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where for k = 1,...,K, the ¢* € R™ are fixed vectors and (n1(-),---,9«(-)) 
are scalar random variables with possible values 71¢,..-,n4¢ for 2=1,...,L. 
We think of K as being 2 or 3. The typical case being when we have a base 
projection: ¢° + ¢1, but we want to consider the possibility that certain factors 
may vary by as much as 25% (plus or minus). In such a case the model assigns 
to the (only) random variable 7(-) some discrete distribution on the interval 
[.75,1.25]. 

With this as background to our study it is natural to search solution pro- 
cedures for recourse problems with discrete distributions when there is either 
only a moderate number of vectors €° to consider (scenarios, limited statistical 
information, approximation) or there is a relatively large number of possible 
vectors €€ that result from combinations of the values taken on by independent 
random variables. The techniques discussed further on, apply to both classes of 
problems, but the tendency is to think of software development that would be 
appropriate for problems with relatively small L, say from 5 to 1,000. Not just 
because this class of problems appears more manageable but also because when 
Lis actually very large, although finite, the overall solution strategy would still 
rely on the solution of approximate problems with relatively small L. 


3.1 Recourse Models as Large Scale Linear Programs 


Substituting in (3.1) the expressions for Q and Q, we see that we can obtain 
the solution by solving the linear program: 


find ze Ri! and for £=1,...,L,y°€R}? 
such that Az=6, 


Ter +Wy =h®, €=1,...,L (3.4) 
L 
and z=cz+ SS pea’y’ is minimized. 
é&=1 


To each recourse decision to be chosen if €(-} takes on the value €° = (9°, h°, 7°) 
corresponds the vector of variables y*. This is a linear program with 


m,+m2:ZL constraints, 


and 
ny +ng-L variables. 


The possibility of solving this problem using standard linear programming soft- 
ware depends very much on L, but even if it were possible to do so, in order 
to avoid making the solving of (3.4) prohibitively expensive—in terms of time 
and required computer memory—it is necessary to exploit the properties of this 
highly structured large scale linear program. The structure of the tableau of 
detached coefficients takes on the form: 
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e pigt pag? ++ prgt 

A = 6 
T Ww 5p 
1? Ww = A 
TL WwW = ht 


Figure 83.1 Structure of discrete stochastic program 


We have here a so-called dual block angular structure with the important ad- 
ditional feature that all the matrices, except for A, along the block diagonal 
are the same. It is this feature that will lead us to the algorithms that are 
analysed in Section 3.3 and which up to now have provided us with the best 
computational results. It is also this feature which led Dantzig and Madansky 
[5], to suggest a solution procedure for (3.4) by way of the dual. Indeed, the 
following problem is a dual of (3.4): 


find o€R™, and for 2=1,...,L,7°€ R™ 


L 
such that oA+ So per®T? <e, 


&1 
3.5 
rWeg’, €=1,...,L Gs) 


L 
and w=o064+ +> penh* is maximized. 
é=1 


Problem (3.5) is not quite the usual (formal) dual of (3.4). To obtain the 
classical linear program dual, set 


ae £ 
= pen 


and substitute in (3.5). This problem has block angular structure, the block 
diagonal consisting again of identical matrices W. The tableau with detached 
coefficients takes on the form: 
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6 pik! poh? +>» prhh 
Al mT, poly +++ prT, < ef 
Ww! < qd 
Ww! < ¢ 
Wo < gt 


Figure 8.2 Structure of dual problem. 


Transposition is denoted by ', e.g. W’ is the transposed matrix of W. Observe 
that we have now fewer (unconstrained) variables but a larger number of con- 
straints, assuming that ng > mg, as is usual when the recourse problem (3.3) is 
given its canonical linear programming formulation. In Section 3.2 we review 
briefly the methods that rely on the structure of this dual problem for solving 
recourse models. 

At least when the technology matrix T is nonstochastic, i.e. when T® =T, 
a substitution of variables, mentioned in Wets [26], leads to a linear program- 
ming structure that has received a lot of attention in the literature devoted to 
large scale dynamical systems. Using the constraints of (3.4), it follows that for 
all £=1,...,2-—1, 

Tz =h*-—Wy* 


and substituting in the (¢+ 1)-th system, we obtain 
—Wyo + Wyo! = pet at he. 
Problem (3.4) is thus equivalent to 


find 2€ Ri! and for €=1,...,L,y° € Ri? 
such that Az =) 


T2+Wy! =h! 
e e_,e@_7e (3.6) 
—Wy-14 Wy =h'-h'-1, €=2,...,L 
L 
and zg=czxz+ > peq’y® is minimized. 
é=1 


With h° = 0 and for @=1,...,L, 
he = he —h&, 
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the tableau of detached coefficients exhibits a staircase structure: 


e pigl peg? +++ pra” 
A = b 
T W = fh} 
—W WwW = f? 
—w Ww = hE 


Figure 3.8 Equivalent staircase structure. 


We bring this to the fore in order to stress at the same time the close 
relationship and the basic difference between the problem at hand and those 
encountered in the context of dynamical systems, i.e. discrete version of contin- 
uous linear programs or linear control problems. Superficially, the problems are 
structurally similar, and indeed the matrix of a linear dynamical system may 
very well have precisely the structure of the matrix that appears in Figure 3.3. 
Hence, one may conclude that the results and the computational work for stair- 
case dynamical systems, cf. in particular Perold and Dantzig [16], Fourer [8], 
and Saunders [19], is in some way transferrable to the stochastic prograraming 
case. Clearly some of the ideas and artifices that have proved their usefulness 
in the setting of linear (discrete time) dynamical systems should be explored, 
adapted and tried in the stochastic programming context. But one should at all 
times remain aware of the fact that dynamical systems have coefficients (data) 
that are 1-parameter dependent (time) whereas we can view the coefficients of 
stochastic problems as being multi-parameter dependent. In some sense, the 
gap between Figure 3.2 and staircase structured linear programs that arise from 
dynamical systems is the same as that between ordinary differential equations 
and partial differential equations. We are not dealing here with a phenomenon 
that goes forward (in time) but one which can spread all over R™ (which is 
only partially ordered)! Thus, it is not so surprising that from a computational 
viewpoint almost no effort has been made to exploit the structure Figure 3.3 
to solve stochastic programs with recourse. However, the potential is there and 
should not remain unexplored. 
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$3.2 Methods that Exploit the Dual Structure 
Dantzig and Madansky [5], pointed out that the dual problem (3.5) with matrix 
structure Figure 3.2 is ripe for the application of the decomposition principle. 
It was also the properties of Figure 3.2 that led Strazicky [21], to suggest and 
implement a basis factorization scheme, further analysed and modified by Kall 
[11], Wets [29], and Birge in Chapter 12. We give a brief description of both 
methods and study the connections between these two procedures. We begin 
with the second one, giving a modified compact version of the original proposal. 
We assume that W is of full row rank, if not the recourse problem (3.3) 
defining Q would be infeasible for some of the values of h® and T® unless all 
belong to the appropriate subspace of R¥ in which case a row transformation 
would allow us to delete the redundant constraints. We also assume that A is 
of full row rank, {possibly 0 when there are no constraints of that type). Thus 
with the columns of A’ and W’ linearly independent (recall that the variables 
o and 7 are unrestricted), and after introducing the slack variables (e° € R}} 
and 6° € RY? for 2 = 1,...,L), we see that each basic feasible solution will 
include at least 22 variables of each subsystem 


rW+eT=q',e°>0, €=1,...,L, (3.7) 


the (unrestricted) m variables a® and a choice of at least (nz — mz) slack 
variables (86,7 =1,...,). Thus the portion of the basic columns that appear 
in the /-th subsystem can be subdivided into two parts 


[Bes Lea] = {(W's Ler)» Lea] 
where (W’, I), ) is an (m2 x mg) invertible matrix and the extra columns, if any, 


are relegated to gg. Thus, schematically and up to a rearrangement of columns, 
a feasible basis B has the structure: 


and in a detached coefficient form: 
Mm : OL OD 


By Tig 


By Thy 


Figure $3.4 Basis structure of dual. 
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The matrix D’ corresponding to the columns of (A’,J;,,) that belong to this 
basis and for €= 1,...,L,C¢ is the 1 x mg matrix: 


= [peTe, 0} 


(recall that 7} is of dimension n; xm). Each Bb, after possible rearrangement 
of row and columns, is of the following type: 


Wie 0 
1 
Be= ; . | = (Za) 
Wee) 
1 


Figure 3.5 Structure of Bj. 


whereW(,) is a my X mz invertible submatrix of W’, and W/,. are the remaining 
rows of W’ that correspond to the rows of the identity that have been included 
in B} (through J},). The simplex multipliers associated with this basis B, of 
dimension n, + 9° L, are denoted by 


and are given by the relations 


a(s)-(8 4) (5)-(3) () 


where [7’, 6] is the appropriate rearrangement of the subvector of coefficients 
of the objective of Figure 3.2 that corresponds to the columns of B’, with # 
being the subvector of [b’,0] whose components correspond to the columns of 
D'. This (dual feasible) basis is optimal if the vectors 


(2,y°,€=1,...,L) 


defined through (3.8) are primal feasible, i.e. satisfy the constraints of (3.4). 
To obtain z and y we see that (3.8) yields 


y = B'(y-Cz) 
t=(D-NB™'C)"1(@-NB“'4). 


For every €=1,...,L, 
= Bz (7° — Cex) (3.9) 
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where 7° is the subvector of [peh®, 0] that corresponds to the columns in By. We 
have used the fact that B is a block diagonal with invertible matrices (Bj, € = 
1,...,£) on the diagonal. Going one step further and using the properties of 
N and C, we get the system for z: 


L i 
(> = Stem'c 2=B-)~ InBev (3.10) 


é&1 é=1 


The system (3.10) involves n equations in n variables and the L systems (3.9) 
are of order ng. Thus instead of calculating the inverse of B—a square matrix 
of order (nm; +1 - L)—all that is needed is the inverse of L matrices of order 
mg and a square matrix of order 71. 

Similarly to calculate the values to assign to the basic variables associated 
to this basis, the same inverses is all that is really required, as can easily be 
verified. In order to implement this method one would need to work out the 
updating procedures to show that the simplex method can be performed in this 
compact form, i.e. that the updating procedures involve only the restricted 
inverses. But there are other features of which one should take advantage 
before one proceeds with implementation. 


Recall that - 

W, 

Be= ( M0 i) (3.11) 
where Wee) is an invertible matrix of size mg x mg. Then 
Wo) —wolw 
g e) * (cé) 

By, = - - (3.12) 

0, I 


Thus it really suffices to know the inverse of W,g), and rather than keeping 
and updating the n 2 x n2-matrix By 1 all the information that is really needed 
can be handled by updating an m3 x m—matrix, relying on sparse updates 
whenever possible. This should result in substantial savings. The algorithm 
could even be more efficient by taking advantage of the repetition of similar 
(sub)bases Wie). We shall not pursue this any further at this time because all 
of these computational shortcuts are best handled in the framework of methods 
based on the decomposition principle that we describe next. 

The decomposition principle, as used to solve the linear program (3.5), 
generates the master problem from the equations 


L 
oAt oe a®(peT®) Ze, 
é=1 


by generating extreme points or directions of recession (directions of unbound- 
edness) from the polyhedral regions determined by the £ subproblems, 


aW <q°. 
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In order to simplify the comparison with the factorization method described 
earlier, let us assume that 


{r|nW <0} = {0}, 


i.e. there are no directions of recession other than 0, which means that for all 2, 
the polyhedra {7’W < q°} are bounded; feasibility of (3.5) implying that they 
are nonempty. For k = 1,...,v, let 


n* at (Gea Aaa”) 


the extreme point generated by the k-th iteration of the decomposition method, 
Le. 


n® € argmin(per’(h° — T’2*) |2°W < g°) (3.13) 


where 2* = (et, 7 = 1,...,71) are the multipliers associated to the first m1 
linear inequalities of the master problem : 


fnd cE R™,A,E€Ri,k=1,...,v 


yv L 
such that oA+ ys Ak oD pen *T*) <e 


k=1 é=1 
“ 3.14 
> Ap = 1 ( ) 
k=1 
vy L 
and w=ob+ > Ak (> pen h®) is maximized. 
k=1 é=1 


The basis associated to the master problem is (nm, x 7), whereas the basis for 
each subproblem is exactly of order 72. In the process of solving the subprob- 
lems the iterations of the simplex method bring us from one basis of type (3.11) 
to another one of this type (all transposed, naturally) with inverses given by 
(3.12). Here again, the implementation should take advantage of this struc- 
tural property, and updates should be in terms of the mz x m2 submatrices 
Wi). But we should also take advantage of the fact that all these subproblems 
are identical except for the right-hand sides and/or the cost coefficients, and 
this, in turn, would lead us to the use of bunching and sifting procedures of 
Section 3.4. 

It is remarkable and important to observe that the basis factorization 
method with the modifications alluded to earlier and the decomposition method 
applied to the dual, as proposed by Dantzig and Madansky [5], require the 
same computational effort; J. Birge gives a detailed analysis in Chapter 12, 
independently B. Strazicky arrived at similar results. In view of all of this it 
is appropriate to view the method relying on basis factorization as a very close 
parent of the decomposition method as applied to the dual problem (3.5), but 
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it does not give us the organizational flexibility provided by this latter algo- 
rithm. On conceptual ground, as well as in terms of computational efficiency, 
it is the decomposition based algorithm that should be retained for potential 
software implementation. In fact, this is essentially what has occurred, but it is 
a “primal” version of this decomposition algorithm, which in this class of (es- 
sentially) equivalent methods appears best suited for solving linear stochastic 
programs with recourse. It is a primal method—which means that we always 
have a feasible z € RY} at our disposal—and it allows us to take advantage in 
the most straightforward manner of some of the properties of recourse models 
to speed up computations. 


3.8 Methods that are Primal Oriented 


The great difference between the methods that we consider next and those 
of Section 3.2 is that finding that solves the stochastic program (3.1) is 
now viewed as our major, if not exclusive, concern. Obtaining the corre- 
sponding recourse decisions (y*,é = 1,...,Z) or associated dual multipliers 
(x*,@=1,...,L) is of no real interest, and we only perform some of these cal- 
culations because the search for an optimal solution z requires knowing some 
of these quantities, at least in an amalgamated form. On the other hand, in the 
methods of Section 3.2 all the variables (o,7',... eae} are treated as equals; to 
have the optimality criterion fail for some variable in subsystem ¢ (even when 
Pe is relatively small} is handled with the same concern as having the optimality 
criteria fail for some of the (o;,2 = 1,...,) variables. 

Another important property of these methods is their natural extension 
to stochastic programs with arbitrary distribution functions. In fact, they are 
particularly well-suited for use in a sequential scheme for solving stochastic pro- 
grams by successive refinement of the discretization of the probability measure, 
each step involving the solution of a problem of type (3.1), cf. Chapter 2. 

We stress these conceptual differences, because they may lead to different, 
more flexible, solution strategies; although we are very much aware of the fact 
that if at each stage of the algorithm all operations are carried out (to optimal- 
ity), it is possible to find their exact counterpart in the algorithms described 
in Section 3.2; for the relationship between the L-shaped algorithm described 
here and the decomposition method applied to the dual, see Van Slyke and 
Wets [20]; between the above and the basis factorization method see Chap- 
ter 13; consult also Ho [10], for the relationship between various schemes for 
piecewise linear functions which are widely utilized for solving certain classes 
of stochastic programming problems, and Chapter 4. 

The L-shaped algorithm, which takes its name from the matrix layout 
of the problem to be solved, was proposed by Van Slyke and Wets [20]; in 
Chapter 12, Birge describes his implementation of this method. It can be 
viewed as a cutting hyperplane algorithm (outer linearization} but to stay in 
the framework of our earlier development, it is best to interpret it here as a 
partial decomposition method. We begin with a description of a very crude 
version of the algorithm, only later do we elaborate the modifications that are 


76 Stochastic Optimization Problems 


vital to make the method really efficient. To describe the method it is useful 
to consider the problem in its original form (3.1) which we repeat here for easy 
reference: 
find ce Rp} 
such that Ag =, (3.15) 


and z=cz +(z) is minimized. 


We assume that the problem is feasible and bounded, implementation of the 
algorithm would require an appropriate coding of the initialization step re- 
lying on the criteria for feasibility and boundedness such as found in Wets 
[27]. The method consists of three steps that can be interpreted as follows. 
In Step 1, we solve an approximate of (3.15) obtained by replacing Q by an 
outer-linearization, this brings us to the solving of a linear programming whose 
constraints are Az = 6,2 > 0 and the additional constraints (3.16) and (3.17) 
that come from: 

(i) induced feasibility cuts generated by the fact that the choice of z must 
be restricted to those for which Q(z) is finite, or equivalently for which 
Q(z, €£ < +00 for all ¢ = 1,...L or still for which there exists ye RY? 
such that Wy = A® — Tz for all 2=1,...,L. 

(ii) linear approximations to Q on its domain of finiteness. 

These constraints are generated systematically through Steps 2 and 3 of the 
algorithm, when a proposed solution z” of the linear program in Step 1 fails to 
satisfy the induced constraints, i.e. Q(z”) = 00 (Step 2) or if the approximating 
problem does not yet match the function Q at 2” (Step 3). The row-vector 
generated in Step 3 is actually a subgradient of @ at 2” . The convergence of 
the algorithm under the appropriate nondegeneracy assumptions, to an optimal 
solution of (3.15), is based on the fact that there are only a finite number of 
constraints of type (3.16) and (3.17) that can be generated by Steps 2 and 3 
since each one corresponds to some basis of W and a pair (AST) or to a basis 
of W and to one of a finite number of weighted averages of the (q°, = 1,...,Z) 
and ((h2,7°),€=1,...L). 

Step 0. Set v =r=s=0. 

Step 1. Set vy =v +1 and solve the linear program 


find zeER OER 
such that Az=5 
Dyt > dy, &=1,...57; (3.16) 
Eyxt+@>ep, k=1,...,8, (3.17) 


and c2+6 =z is minimized. 


Let (z”,0”) be an optimal solution. If there are no constraints of type (3.17), 
the variable 6 is ignored in the computation of the optimal z”, the value of 0” 
is then fixed at —oo. 
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Step 2. For €=1,...,L solve the linear programs 


find yER}?,vt ERP?,0 ERT? 
such that Wy+Jut —Iv7 =ho -T®e” (3.18) 
e 


and evt + ev =v° is minimized 


(here e denotes the row vector (1,1,...,1)), until for some £ the optimal value 
v& >0. Let o” be the associated simplex multipliers and define 


Dri = oT? 
and 
dea = ov he 


to generate an induced feasibility cut. Return to Step 1 adding this new con- 
straint of type (3.16) and set r = 1r+1. If for all 2, the optimal value of the 
linear program (3.18) v® = 0, go to Step 3. 

Step 8. For every 2=1,...,£, solve the linear program 


find ye R,? 
such that Wy =ho-T°e’, (3.19) 
and = q‘y = w* is minimized. 


Let +” be the multipliers associated with the optimal solution of problem &. 
Set ¢ =¢+1 and define 


L 
Er = S; per’ T®, 
é=1 


L 
a= a per” hi, 
&1 
and 


£ 
w’ = > per” (ho ~T?e”) = — Eye”. 
é=1 
If 0” > w”, we stop; 2” is the optimal solution. Otherwise, we return to Step 
1 with a new constraint of type (3.17). 

An efficient implementation of this algorithm, whose steps can be identi- 
fied with those of the decomposition method applied to the dual problem (see 
Section 3.2), depends very much on the acceleration of Steps 2 and 3. This 
is made possible by relying on the specific properties of the problem at hand 
(3.15), and it is in order to exploit these properties that we have separated 
Steps 2 and 3 which are the counterparts of Phase I and Phase II of the simplex 
method as applied to the recourse problem (3.3). In practice one certainly does 
not start from scratch when solving the L linear programs in Step 3; Section 3.4 
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is devoted to the analysis of Step 3, i.e. how to take advantage of the fact that 
the £ linear programs that need to be solved have the same technology matrix 
W as well as from the fact that the ¢° = (q°,h°T°) are the realizations of a 
random vector. Here we concern ourselves with the improvements that could 
be made to speed up Step 2, and we see that in many instances, dramatic gains 
could be realized. 

First and for all, Step 2 can be skipped altogether if the stochastic program 
is with complete recourse, i.e. when 


pos W := {t|t=Wy,y>0}=R™, (3.20) 


a quite common occurrence in practice. This means naturally that no induced 
feasibility constraints (3.16) need to be generated. This will also be the case 
if we have a problem with relatively complete recourse i.e. when for every 
satisfying Az = b,z > 0, and for every €=1,...,£, the linear system 


Wy =h®-T*s,y > 0, 


is feasible. This weaker condition is much more difficult to recognize, and to 
verify it would precisely require the procedure given in Step 2. 

Even in the genera] case, it may be possible to substitute for Step 2: for 
some (h” ,T”) 
Step 2’. Solve the linear program 


find yeER}?,ot ERT ,v ERT 
such that Wy+Jut —Iv” =(h” —T’ 2’) (3.21) 
and evt +ev. = 4” is minimized. 


Let o” be the associated simplex multipliers and if the optimal value of v” > 0, 
define 
Dry, =0"T”, 


and 
dr+4 = ov’ hY 


to generate an induced feasibility cut of type (3.16). Return to Step 1 with 
r=r+1. If the optimal value of »” = 0, go to Step 3. 

This means that we have replaced solving L linear programs by just solving 
1 of them. In some other cases it may be necessary to solve a few problems 
of type (3.21) but the effort would in no way be commensurate with that of 
solving all L linear programs of Step 2. In Section 3.5 of Wets [28], one can find 
a detailed analysis of the cases when such a substitution is possible, as well as 
some procedures for the choice or construction of the quantities h” and T” that 
appear in the formulation of (3.21). Here we simply suggest the reasons why 
this simplification is possible and pay particular attention to the case when the 
matrix T is nonstochastic. 
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Let < be the partial ordering induced by the closed convex polyhedral cone 
pos W, see (3.20), i.e. a! < a? if a? —a' € posW. Then for given z € R"! and 
for every £=1,...,2, the linear system 


Wy =h®-T®2",y >=0 (3.22) 
is feasible, if there exists a” € R™2 such that for all 2=1,...,L, 
a” <h® Te’, (3.23) 


and the linear system 
Wy=a",y>0 (3.24) 


is feasible—or equivalently a” € posW. There always exists a” that. satisfies 
(3.23), recall L is finite. If in addition, a” can be chosen so that 


a” =h’ -T’z (3.25) 


for v € {1,...,L}, then (3.22) is feasible for all @ if and only if (3.24) is feasible 
with a” as defined by (3.25). Although in general such an a” does not exist, in 
practice, at. most a few extreme points of the set 


SY” ={ala=h® —- T's", ¢=1,...,L}, 


need to be considered in order to verify the feasibility of aif the linear systems 
(3.22). Computing lower bounds of S” with respect to < may require more 
work than we bargained for, but it really suffices, cf. Theorem 4.17 of Wets 
[28], to construct lower bounds of S” with respect. to any closed cone contained 
in pos W, and this could be, and usually is taken to be, an orthant. In such a 
case obtaining a” is effortless. 

Let us consider the case when T is nonstochastic and assume that pos W 
contains the positive orthant, if it contains another orthant simply multiply 
some rows by —-1 making the corresponding adjustments in the vectors (h°,é = 
1,...,L). This certainly would be the case if slack variables are part of the 
y-vector, for example. 


For ¢ = 1,...,m,; 
let a; = min hy 


If a = h” for some v € {1,...,L}, which would always be the case if the 
(A, (-),¢ = 1,...,m) are independent random variables, then it follows from 
the above that for 2=1,...,2, the linear system 


Wy =h®-Ter’,y>0 


is feasible if and only if 
Wy=a-T2’,y >0. 
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is feasible. Note that in this case the lower bound 
a’ =a—Tx” 


is a simple function of 2”. 

In our description of the L-shaped algorithm the connections to large scale 
linear programming may have been somewhat lost, if anything it is how to deal 
with the “nonlinearity” of Q which has played center stage. To regain maybe 
a more linear programming perspective it may be useful to view the algorithm 
in the following light. Let us return to the dual block angular structure Figure 
3.1 from which it is obvious that if we can adjust the simplex method so that 
it operates separately on the z-variables and the (yg-variables, £=1,...,L), it 
will be possible to take advantage of the block diagonal structure of the problem 
with respect to the (y°-variables, £=1,...,L). Given that some 2” is known 
which satisfies the constraint z > 0, Az = b, then finding the optimal solution 
of Figure 3.1, with the additional constraint z = x” leads to solving a linear 
program, whose tableau of detached coefficients has the structure: 


pig’ pag? pig” 
W = hy 
W = 1 
WwW = Aly 


Figure 3.6 Structure of the y-problem. 


where for £=1,...,L,h% = h® — Tz”. Clearly, when confronted with such 
a problem we want to take advantage of its separability properties and this is 
precisely what is done in Steps 2 and 3 of the L-shaped algorithm. 

The structure of Figure 3.6, with the same matrix W on the block diagonal, 
suggests that of a distributed system. A continuous version would take the form: 


find y:Q—-R"2 
such that Yw EN (3.26) 
y(w) € argmin[g(w)y|Wy =h’(w),ye RY? ]. 


Because of the linearity of the objective function, the trajectory w t+ y(w) 
will be linear with respect to h” if the same basis of W remains optimal. The 
main task in solving (3.26) would be to decompose © in regions of linearity of 
y(-). Once this decomposition is known the remainder is rather straightforward. 
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Finding this decomposition is essentially the subject of Section 3.4, which con- 
cerns itself with the organization of the computational work so as to bring the 
effort involved to an acceptable level. Problem (3.26) again brings to the fore 
the connections between this work and that on dynamical systems (continuous 
linear programming). With not too much difficulty it should be possible to 
formulate a bang-bang principle for systems with distributed parameters space 
(here R™2) that would correspond to our scheme for decomposing 1. 

To conclude our discussion of the L-shaped algorithm, let. us record a fur- 
ther modification suggested by L. Nazareth. When the matrix T is nonstochas- 
tic, say T® = T for all @, then with y = Tz, ¥(y) = (Z's) = (2), the linear 
program in Step 1 may be reformulated as 


find 2zeERi!,xER™,0ER 
such that Az=6 
Tze-x=0 
Fry >fe, k=1,...,7 
GextO> Gr, k=1,...,8, 


and cz+06 =z is minimized. 


(3.27) 


The induced feasibility constraints are generated as earlier in Step 2 with 
Fear =o" fet =o" ht 


The optimality cuts (approximation cuts) are generated in Step 3 with 
L 
Gp = S> per”, 
é=1 
L 
n= S> per iad 
t1 


The linear program that generates the o” and 7™ as (optimal) simplex mulkti- 
pliers of Phases [ and [I respectively, is given by 


find ye R’? 


such that Wy=h®—y’, 
and gy = w® is minimized. 
Note that now the “nonlinearity” is handled in a space of dimension m2 which 
is liable to be much smaller than n,, and we should reap all the advantages 
that usually come from a reduction in the number of nonlinear variables. 
All of these simplifications come from the fact that when T is nonstochastic 
we can interpret the search for an optimal solution, as the search for an optimal 
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x*, “the certainty equivalent”. It is easy to see that knowing x* would allow 
us to solve the original problem by simply solving 


find 2eR" 
such that Az=b,Tr=y"*, (3.28) 
and z= cz is minimized. 


The sequence {y”,” = 1,...} generated by the preceding algorithm can be 
viewed as a sequence of tenders (to be “bet” against the uncertainty represented 
by A). This then suggests other methods based on finding x* by considering 
the best possible convex combination of the tenders generated so far; these 
algorithms are based on generalized linear programming, see Nazareth and Wets 
[15], and Chapter 4 of this Volume. In the context of the general class of linear 
stochastic programming problems considered here, we have up to now very 
limited experience with this method. The algorithm would proceed as follows: 
Step 0. Find a feasible 2° € RY such that Az® = 6 

Set x? = 2° 

Choose y!,...,x”, potential tenders, v > 0. 
Step 1. Find (c”,7”,@,) the (optimal) simplex multipliers associated with 
the solution of the linear program: 


minimize cz + S79 Ae (x) 
Az=b:0" 
Ta — opp Aex® =0: 0” 
an Ag=1:6, 
z>0,\42.2>0 


Step 2. Stop unless there exists y”*! such that 
P(x) taryttt <a, (3.29) 


in which case return to Step 1 withy =v +1. 

The attractiveness of this approach rests on the fact that the algorithm allows 
for the choice of a number of tenders (trial solutions) which would provide an 
excellent initial approximate solution to the problem as a whole just after 1 
passage through Step 1, assuming of course that the tenders x!,...,x” are 
chosen by an informed problem solver. Note, however, that for each tender 
x € R"! we need to find the value of ¥(x) = Ba pep(x, €°), ie. solve the L 
linear programs 


find ye Ri? 
such that Wy=h° -T*y, 
and (x, é*) = gy is minimized. 
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Of course in order to do so, we can take advantage of the techniques described 
in the next section. 

As suggested by Nazareth [14], Step 2 should not be carried out to opti- 
mality, by which one means: find x”*+! that minimizes &(x) +7” x . All what 
is really necessary is to find a tender that satisfies the condition (3.29) given in 
Step 2. Nazareth points out that if this strategy is followed, the complete set of 
calls to Step 2 will be of similar computational effort as that of solving problem 
(3.1), whereas carrying out Step 2 to optimality would require at each iteration 
essentially the same amount of work as solving (3.1). In fact Nazareth [14], 
suggests that Step 2 should be done with a nonsmooth optimizer (using the 
bunching techniques to be discussed in Section 3.4). This is also the direction 
of the algorithmic research recently undertaken by Kiwiel [12]. 


$8.4 Sifting, Bunching and Bases Updates 


In the final analysis, Step 3 of the L-shaped algorithm boils down to the calcu- 
lation of the value of Q and of its gradient at 2”. What it involves is solving 
a large number of similar linear programs, or if you prefer one linear program 
with matrix structure as in Figure 3.6. The same type of operations would be 
required for the actual carrying out of Step 2 of the algorithm based on the 
generation of tenders. The extent to which we are able to speed up these com- 
putations will determine the level of “stochasticity” that we are able to handle. 
This Section raises the question of how to organize the work so as to mini- 
mize the computational effort involved. We consider only the case of multiple 
right-hand sides, resulting, as the case may be, from & and/or T random; by 
duality, the analysis also applies to the case when only q is random (and h and 
T are nonstochastic). When both the cost coefficients and the right-hand sides 
of the recourse problem (3.3) include random variables a further refinement 
of the methods suggested here would be required. We shall not be concerned 
with special cases such as simple recourse W = (£,—I), or network-structured 
problems when specific computational shortcuts are possible, e.g. Midler and 
Wollmer [13], Wallace [23], and Qi [17]. 

In its simplest form, the problem that we are concerned with is finding an 
efficient procedure for solving L linear programs with variable right-hand sides: 
for 2=1,...,L, 

find ye R)? 


such that Wy=t°, (3.30) 
qy = w® is minimized. 
The cost coefficients are constant, we simply write g for g! = q? =...=q". In 
terms of Step 3 of the L-shaped algorithm, the vectors r = {t°,@ = 1,...,L} 
come from t@ = h® — T®z” for some fixed 2”. 
For all /, (3.30) is feasible, i.e. 


t© e posW = {tt =Wy,y > 0}, (3.31) 
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(this comes from the fact that 2” or x” satisfies the induced feasibility con- 
straints). Moreover, by assumption we have that (3.30) is bounded, and hence 
for all £, (3.30) is solvable. We shall denote the optimal solution by y®, and the 
associated simplex multipliers by 7°. We have that 


rW< q 


and 
ay! = x"! 

The methods that we study can be divided into sifting (discrete parametric 
analysis) and bunching (basis by basis analysis) procedures. We begin with 
a description of a very crude bunching procedure, which nonetheless would 
be much more efficient than solving separately all L linear programs (3.30). 
This technique is easily modified to also take care of the case of random cost 
coefficients, cf. Wets [29], p.587. 

Let B be an m3 X mg invertible submatrix of W with yB~!W < gq where 7 
is the subvector of g that corresponds to the columns of W in B; recall that W is 
assumed to be of full row rank. Then from the optimality conditions for linear 
programming, it follows that this basis B is optimal for any vector ¢ € R™2 
such that 


. 


Bo't>0 (3.32) 
and then the optimal simplex multipliers are given by 
n= 7B}. 


This means that pos W is decomposable into a number of simplicial cones of 
the type pos = {t|B~'t > 0}, such that whenever ¢ € posB then B is an 
optimal basis for the linear program: find y € RP such that Wy = ¢ and 
w = gy is minimized. Moreover, on pos B, the (optimal) simplex multipliers 
Temain constant. All of these observations can be rendered very precise and are 
summarized in the Basis Decomposition Theorem, Walkup and Wets [22]. The 
figure below illustrates such a decomposition. 

Now suppose that we solve the linear program (3.30) for some ¢, and Bi1) 
is the corresponding optimal basis. Since Bay is readily available, finding the 
bunch of vectors ¢® for which Bi.) is the optimal basis is relatively easy since 
all we need to do is to verify if 


Bajt* > 0. (3.33) 


Let B, be the family of all such vectors, 7(,) be the corresponding simplex 
multipliers and the probability mass associated with B, given by 


Pao) = Pe. 


t8eB, 
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Figure 3.7 Decomposition of posW. 


All vectors t® that have failed the nonnegativity test (3.33) are in 7; = r\By- 
We are now in the same situation as at the outset. Picking a vector in 7), 
we obtain a new basis Biz), the corresponding vector 72) the bunch By and 
associated probability mass pcg). This process is continued until all t© er have 
been bunched. The expected value of these linear programs—the quantity that 
would correspond to (z”) or ¥(x”)—is given by: 


Des F (k) y pot®. 
k 


t€eB, 


The expected simplex multiplier—a quantity used in the construction of feasi- 
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bility and optimality cuts—is given by : 
Drm me 
k 


A number of computational shortcuts come immediately to mind as suggested 
by the decomposition of pos W. First, note that 7 or even cor (the convex 
hull of +), is a subset of posW that meets some—and usually only a few—of 
the simplicial cones that are part of this decomposition. Moreover, most of the 
vectors in 7 will be found in adjacent cells, thus instead of just picking any 
vector ¢ that failed the (nonnegativity) test (3.33), we could choose a vector 
é in 7 such that that belongs to a neighboring cell, which necessarily means 
that Bayt has exactly 1 negative entry; note that Bajt having exactly one 
negative entry does not automatically imply that ¢ belongs to an adjacent cell 
of pos Bi,). Passing from pos Bi;) to a neighboring cell requires just one (dual) 
pivot step. It is clear that substantial computational savings could be realized 
by a systematic organization of the work. 

One way is to proceed as suggested in Wets [29]: pick a vector t € 1, say 
t!, and solve the linear program (3.30) with £= 1. Let By,) be the optimal 
basis. Multiply each vector ¢ in 7 by Ba): The bunch B, is the collection of 
all vectors ¢ such that 

Fay = Bayt 20. 

For each vector rye € 7, =1\Bj, with necessarily at least 1 negative element, 
we record the actual number of negative entries as well as me the magnitude of 
the most negative element. Now choose a vector ¢ in 7; with a minimal number 
of negative entries and among them one with mg as small as possible. Pivot, 
relying on the criteria provided by the dual simplex method, to obtain the next 
(optimal) basis B,2), the associated multipliers (2) and construct 73; and then 
continue in a similar manner. 

What all of this comes down to is that we build a partitioning of that 
portion of pos W that covers r (or cor). What we need is the sublattice 
structure of the cells that contain 7. In certain cases it may be possible to work 
out the complete decomposition of posW and then use it whenever we enter 
Step 3 of the L-shaped algorithm. Each subbasis of W that generates a cell of 
the decomposition is recorded with labels that point to the neighboring cells. 
The lattice generated by the decomposition in Figure 3.7, would take the graph 
structure given in Figure 3.8. The labeling of the nodes could be the indices of 
the columns in the basis. 

The pointers would correspond to the pivot step required to pass from 
one basis to a neighboring one. Here this is a planar graph but that would 
not necessarily be the case if mg > 3. In general, working out the complete 
decomposition of posW may be a serious undertaking, the number of cells could 
increase exponentially as a function of mg (for ng sufficiently large). Even for 
problems whose recourse matrix W have a network structure, the number of 
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(7,9, §) (7,9, 10) (7, 3, 10) 


(5, 9, 4) (6, 2, 10) 





(6, 8, 15) 


Figure 8.8 Lattice of the decomposition of pos W. 


components in a complete decomposition of pos W may become unmanageable 
even for relatively “small” problems, see Wallace [23]. 

Short of first working out a complete decomposition and then finding a 
good path through the lattice, so as to minimize the number of operations, 
what could be done? What appears the most efficient approach to date is to 
bunch the elements of r by a trickling down procedure that we describe next. 
Unless there are some good reasons for proceeding otherwise—for example the 
inverse of a “good” subbasis of W is available—we would start by finding the 
cell associated with ¢, where 

t= ae pet® 


t€er 


is the mean of the vectors in 7, geometrically: the centroid of r. We have to 
solve the linear program: 


find ye Ri? 
such that Wy =t, 
and gy is minimized. 


This yields an optimal basis Bi,), its inverse Bay and associated multiplier 
71). We assume that Bay is stored as an explicit dense matrix. Now consider 
t! and sequentially perform the multiplications 


[Balt = #7. 


If {} > 0 for all ¢, place t' in bunch 1, otherwise stop as soon as for some index 
a,t} <0. Perform one dual simplex step, with pivot in row 2. In doing so we 
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create a new basis By) with 
[Boy]: t! 2 0 


(preserving dual feasibility). The branching from By) occurred on 7. Repeat 
the same procedure with Bia) instead of Bi,), branching if necessary (record- 
ing the branching index), otherwise assigning t! to bunch 2. If branching did 
occur, then continue until a basis Bw) is found such that Bayt’ > 0. This 
will necessarily take place since t! € r C posW by assumption, and the pivot 
path is a simplex path for the dual problem with the pivot choice determined 
by the first negative entry; degeneracy could be resolved by a random selec- 
tion rule or Bland’s rule. This procedure creates a tree, rooted at Bi), whose 
nodes correspond to the bases (associated with the cells of the decomposition 
of posW), the branches being determined by the first negative entry encoun- 
tered when multiplying t by Ba): Figure 3.9 gives part of such a tree for the 
decomposition of Figure 3.7 assuming that 7 covers posW, and that 


t € pos(W°, W®, W!°). 
The number on the branches indicating branching on the 7-th entry that leads 
to the subsequent basis. 


(6, 9, 10) 


(7,9, 10) 


(7,9, 5) 





(4, 9, 5) 


Figure 3.9 Tree generated by trickling down procedure 


Note that the same cell may be discovered on different branches of the tree. 
No effort would be made to recognize that this is taking place, since too much 
computational effort would be involved in trying to identify such a situation, 
and only marginal gains could be reaped as will be clear from the subsequent 
development that concerns updates, i.e. the information necessary to pass from 
one node of the tree to the next. 
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It is clear that a great. amount of calculations are bypassed by the trickling 
down procedure, by comparison to the “rough” version of the bunching proce- 
dure described at the beginning of this Section. However, it may appear that 
the storage of all inverse bases (corresponding to the nodes of the tree) as well 
as keeping track of pointers may negate all the advantages that may be gained 
from this bunching technique. This, however, can be overcome by relying on 
Schur-complement updates for the bases B(,). Updates of this type in the con- 
text of linear programming were first suggested by Bisschop and Meeraus [3], 
[4]. Suppose Big) is obtained from Bio) by adding & columns—without loss 
of generality assume they are W(x) = [wa y+ + Wk | —and by pivoting out k 
columns. The equation 

Bey! =f 


where y’ € R™2 can also be rewritten as 


Bo, Way\ fy) _(t 
Lik) 0 z 0 
where J(,) is part of an identity matrix with rows having their entry | corre- 


sponding to the columns that have to leave the basis when passing from Bio) 
to Biz). This matrix of coefficients can be written as a block LU product 


(Fo “) = (Fo 0 ) ( “9 ) 
Ix) 0 Tie) Cory} \O FT 
where the J’’s in the last matrix are mg X mg and k x k indentity matrices. We 
have that oa 
Ye) = Boy M(x)» 
Cony = ~ Lin) Ye) 


and thus 


O(n) = —L(ay Boy Mie) 


This matrix is & x & and is the only information that is needed to reconstruct 
all that is needed at the node associated with Big), in addition to Bo) which 
is supposed to be available (in an LU form, for example). This means that at 
depth 1 in the tree, only 1 x 1 updates are necessary; at depth 2, 2 x 2 updates. 
Since we reasonably expect to find the largest number of points of 7 in the 
immediate neighborhood of we do not expect to have to construct very long 
(deep) trees, and the updating information should be of manageable size. 

Bunching by the trickling down procedure appears to minimize the amount 
of operations needed to assign a given ¢ € 7 to its bunch, and by relying on 
Schur-complement updates the amount of information required at each node is 
kept very low. When k—the number of bunches-—gets to be too large it may 
be necessary to start a tree with a new root. This approach to bunching can 
even be used effectively in specially structured problems such as worked out in 
Wallace [24], in the case of networks. 
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The sifting procedure, a sort of discrete parametric analysis, has been 
proposed by Garstka and Rutenberg [9]. It is designed for handling the case 
when the points in 7 are the possible realizations of mg independent random 
variables, for example when T is nonstochastic and the h;(-),7=1,...,m, are 
independent random variables. We assume that the vectors in r C posW are 
obtained by setting for every 7 = 1,...,ma, 

ti = Tie 
for some ¢ € {1,...,4;} where we have ordered the 1;1 i.e., 
Th S90 < Tk; 
We have thus a doubly indexed array: 
Ty STR SS They 


Tar S799 SX Tks (3.34) 


Tmg,1 <  Tmg,2 <*'*< Tmg,km: 
We sift through this array in the following manner: let 
t! = (111, 721,- saetuesii)s 
and solve the linear program 
find ye R)? 
such that Wy=t', 
and gy is minimized. 


Suppose that B(,) is the associated optimal basis, with 
(Bo) =[8',6,...,8™] 


Recall that t € pos By) as long as [Bay lt > 0. Hence to find out which subset 
of vectors belong to pos By,), for £ = mg,...,1 we study systematically the 
range of values of 7 that satisfy: 


>. Frye; | +8°r 20 


for some fixed Tie; € Utigses T3,k;) and record those values of 7¢., that belong 
to that range; the corresponding ¢-vectors are then in pos B,;). More specifi- 
cally, identify first the largest index & such that 


mg-1 


>> Pry | +B tg b 2 0. 


j=l 
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All vectors (111, 721,++- iTimg=1,13Tip,0) with e«=1,..., are recorded as being 
in pos By). We then “move” t,.-1,1 t0 Tmg—1,2 and repeat the same analysis 
on the last coordinate of ¢. If 


m2—2 
r| ye Pry +B" 5m. -1,9+ 827 >0 (V[rmg,117mg,kg] = 4, 


j=l 


we return the (mm —1)-th coordinate of t to rm,—1,1 and increase the preceding 
element of f to its next higher value, otherwise it is the (m2 — 1)** coordinate 
which is increased (discretely) to its next higher value, if possible; if not it 
is again the (mq — 2)-th coordinate which is pushed to its next value. This 
is continued, systematically, until the search with By) is exhausted. We now 
restart the procedure with the “lowest” vector 


(T1545 78399- <8 Tiga) 


which has not been included in the first bunch, i.e. for every ¢ = 1,..., 79, 
the index J; is as small as possible. The procedure is repeated until all possible 
vectors generated by the array have been assigned to a given bunch. Further 
details can be found in Garstka and Rutenber [9], who also report computational 
experience which would favor this approach with respect to the coarse bunching 
procedure described at the beginning of this section. However, to rely on this 
procedure we must be in this specific situation, i.e. when the vectors in 7 can 
be given the array representation (3.34) and this is not always the case, we 
often deal with dependent random variables and if (3.1) is the result of an 
approximation scheme then the chosen discretization will usually not be of this 


type. 


$8.5 Conclusion 


At this stage of algorithmic development for (linear) stochastic programs with 
recourse, decomposition-type methods aided by a number of shortcuts made 
possible by the structural properties of the problem, appear as the clear cut 
favorites. Of course, this is mostly due to the fact that they allow us to exploit 
to the fullest these structural properties, see Section 3.4, but there may be 
some other justification for using decomposition-type methods. Experiments, 
cf. Beer [1], have shown that with the decomposition method, a value near 
the optimum—Beer speaks of an error of no more than 3% —is reached at an 
early stage of the computation. Given on one hand the stability of the solution 
to stochastic programs—see Dupatova [6], Wang [25]—and on the other hand 
our limitations in the (precise) description of stochastic phenomena or other 
sources of uncertainties, as mentioned in Section 3.1, a rapid convergence to an 
approximate solution is all that is expected and required. If solving the discrete 
stochastic program (3.1) is part of a sequential scheme for solving a stochastic 
program with continuous probability distribution or with a discrete distribution 
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involving many more points than L , then it would not be necessary to solve up 
to optimality before a further refinement is introduced. Again decomp osition- 
type methods that exhibit rapid convergence to nearly optimal solutions would 
be ideally suited in such a scheme. 
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CHAPTER 4 


NONLINEAR PROGRAMMING TECHNIQUES APPLIED 
TO STOCHASTIC PROGRAMS WITH RECOURSE 


L. Nazareth and R.J-B Wets 


Abstract 


Stochastic convex programs with recourse can equivalently be formulated as 
nonlinear convex programming problems. These possess some rather marked 
characteristics. Firstly, the proportion of linear to nonlinear variables is often 
large and leads to a natural partition of the constraints and objective. Secondly, 
the objective function corresponding to the nonlinear variables can vary over a 
wide range of possibilities; under appropriate assumptions about the underlying 
stochastic program it could be, for example, a smooth function, a separable 
polyhedral function or a nonsmooth function whose values and gradients are 
very expensive to compute. Thirdly, the problems are often large-scale and 
linearly constrained with special structure in the constraints. 

This paper is a comprehensive study of solution methods for stochastic pro- 
grams with recourse viewed from the above standpoint. We describe a number 
of promising algorithmic approaches that are derived from methods of non- 
linear programming. The discussion is a fairly general one, but the solution 
of two classes of stochastic programs with recourse are of particular interest. 
The first corresponds to stochastic linear programs with simple recourse and 
stochastic right-hand-side elements with given discrete probability distribution. 
The second corresponds to stochastic linear programs with complete recourse 
and stochastic right-hand-side vectors defined by a limited number of scenarios, 
each with given probability. A repeated theme is the use of the MINOS code 
of Murtagh and Saunders as a basis for developing suitable implementations. 
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4.1 Introduction 
We consider stochastic linear programs of the type 


find ze R"l 
such that Az =6,2>0 (4.1) 


and z= Ey[c(w)z + Q(z,w)] is minimized 


where @ is calculated by finding for given decision z and event w, an optimal 
recourse y € R"2, viz. 


Q(z, w) = inf f(y, 0)| Wy = h(u) Tz} (4.2) 


Here A(m, x mi) , T (m2 x 21), W(me x ng) and b(m,) are given (fixed) ma- 
trices, c(-)(ny) and A(-)(m ) are random vectors, y > q(y,:): R"2? > Risa 
random finite-valued convex function and C is a convex polyhedral subset of 
R32, usually C = RY? . E denotes expectation. 

With c = E,,|c(w)|, an equivalent form to (4.1) is 


minimize cz + Q(z) 
subject to Az =b (4.3) 
r>0 


where Q(z) = £.,[Q(z,w)]. Usually g(y,w) will also be a linear nonstochastic 
function gy. (For convenience, we shall, throughout this paper, write cz and gy 
instead of c™g and q*y.) 

Two instances of the above problem are of particular interest: 


(C1) Problems with simple recourse i.e. with W = [J,—J], stochastic right- 
hand-side elements with given discrete probability distribution and penalty 
vectors g* and g” associated with shortage and surplus in the recourse 
stage (4.2). 

(C2) Problems with complete recourse and stochastic right-hand-side vectors 
defined by a limited number of scenarios, each with given probability. 


Henceforth, for convenience, we shall refer to these as C1 and C2 prob- 
lems respectively. They can be regarded as a natural extension of linear and 
nonlinear programming models into the domain of stochastic programming. 
More general stochastic programs with recourse can sometimes be solved by an 
iterative procedure involving definition (for example, using approximation or 
sampling) of a sequence of Cl or C2 problems. 

Within each of several categories of nonlinear programming methods, we 
summarize briefly the main underlying approach for smooth problems, give 
where appropriate extensions to solve nonsmooth problems and then discuss 
how these lead to methods for solving stochastic programs with recourse. Thus, 
in each case, we begin with a rather broadly based statement of the solution 
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strategy, and then narrow down the discussion to focus on methods and compu- 

tational considerations for stochastic programs with recourse, where the special 

structure of the problem is now always in the background. (During the course 
of the discussion we occasionally consider other related formulations, in par- 
ticular the model with probabilistic constraints. However it is our intention to 
concentrate upon the recourse model. (We do not discuss questions concerning 
approximation of distribution functions, except very briefly at one or two points 
in the text). This paper is not intended to provide a complete survey. Rather, 
our aim is to establish some framework of discussion within the theme set by the 
title of this paper and within it to concentrate on a number of promising lines of 
algorithmic development. We try to strike a balance between the specific (what 
is practicable using current techniques, in particular, for C1 and C2 problems) 
and the speculative (what should be possible by extending current techniques). 
An important theme will be the use of MINOS (the Mathematical Program- 
ming System of Murtagh and Saunders [49],|50]) as a basis for implementation. 
Finally we seek to set the stage for the description of an optimization system 
based upon MINOS for solving C1 problems, see Nazareth [55]. 

We shall assume that the reader is acquainted with the main families of 
optimization methods, in particular, 

(a) univariate minimization, 

(b) Newton, quasi-Newton and Lagrangian methods for nonlinear minimiza- 
tion, 

(c) subgradient (nonmonotonic) minimization of nonsmooth functions, possi- 
bly using space dilation (variable metric), and the main descent methods 
of nonsmo oth minimization, 

(d) stochastic quasi-gradient methods, 

(e) the simplex method of linear programming and its reduced-gradient exten- 
sions. 


Good references for background material are Fletcher [20], Gill et al. [23], 
Bertsekas [4], Lemarechal [42], Shor [66], Ermoliev [16], Dantzig [11], Murtagh 
& Saunders [49]. 

We shall concentrate upon methods of nonlinear programming which seem 
to us to be of particular relevance to stochastic programming with recourse and 
discuss them under the following main headings: 


1. Problem Redefinition 
2. Linearization Methods 
3. Variable Reduction (Partitioning) Methods 
4, Lagrange Multiplier Methods 
A nonlinear programming algorithm will often draw upon more than one of 


these groups and there is, in fact, significant overlap between them. However, 
for purposes of discussion, the above categorization is useful. 
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4.2 Problem Redefinition 

By problem redefinition we mean a restructuring of a nonlinear programming 
problem to obtain a new problem which is then addressed in place of the original 
one. This redefinition may be achieved by introducing new variables, exploiting 
separability, dualizing the original problem and so on. For example, consider 
the minimization of a polyhedral function given by 


° ‘VT , 
gain, max |(@’)' 2+ 2] (4.4a) 


This can be accomplished by transforming the problem into a linear program 
minimize 


2 . 4.4b 
such that v>(a@)Te4+P, g=1,...,m ea) 


which can then be solved by the simplex method. 
Problem redefinition often precedes the application of other solution meth- 
ods discussed in later sections of this paper. 


4.2.1 Application to Recourse Problems 
The following two transformations of recourse problems will prove useful: 


(a) When the technology matrix is fixed, new variables x , termed tenders, can 
be introduced into (4.3). This gives an equivalent form as follows: 


minimize cz + ¥(x) 

subject to Az=b 
Tze-x=0 
z>0. 


(4.5) 


(4.5) is useful because it is a nonlinear program in which the number of 
variables occurring nonlinearly is mq instead of n, and usually my  n). 
For a more detailed discussion of the use of tenders in algorithms for solving 
stochastic linear programs with recourse, see Nazareth and Wets [56]. 


(b) Another useful transformation involves introducing second stage activities 
into the first stage. It is shown in Nazareth [51] that that an alternative 
form equivalent to (4.5) is 


minimize cz +gy+ (x) 

subject to Az=0d 
Tz+Wy-x=0 
z2>0,y2>0. 


(4.6) 


This transformation also has significant advantages from a computational 
standpoint, as we shall see below. These stem, in part, from the fact that 
dual feasible variables, say (p,7) satisfy W171 < q. 
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For Cl problems, (x) in (4.5) is separable, ie. U(x) = S072? U(x). 
In such problems, each component of h(-) is assumed to be discretely dis- 
tributed, say with h;(-) given by levels hy1,... vhik, and associated probabil- 
ities pj1,.--,Pik;3 also q(y) in (4.2) is two-piece linear and can be replaced by 
qtyttqy,yt >0,y >0 in (4.6). This implies that each ¥,(x;) is piece- 
wise linear with slopes, say, 8;¢,£ = 0,...,4;. By introducing new bounded 
variables z;¢,2=0,...,k; we cam reexpress x; as 


k; 
xi = hio + D- zie 
é=0 


where Ajo is the t-th component of ho the base tender. Then (4.5) takes the 
form: 


mg &y 
minimize cz+ S- > 8 ere 
i=1 €=0 


subject to Az=5 
: k, (4.7) 
T's— S zie =hion, t=1,...,me 
é=0 


2>0,0< ze < dye, £=0,...,h; 
with dye = hjepi — Ave. 


T' denotes the i-th row of J. Optionally we can use the transformation (4.6) 
to introduce W = [I,—J] into the first stage. Details of an algorithm based 
upon (4.7) can be found in Wets [72] and an alternative simpler version of this 
algorithm can be found in Nazareth & Wets [56]. The latter algorithm is im- 
plemented in the optimization system described [55], where further discussion 
and computational considerations may be found. 
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4.2.2 Extensions 


The device of introducing new bounded variables, which was used to obtain 
(4.7), can be applied to a wider class of recourse problems. The assumptions 
of discrete distribution of h(-) and of two (or more) piece linearity of recourse 
objective are not central, although one must still retain the assumptions of 
simple recourse and separable recourse objective. Suppose, for example, the 
distribution function of h,(-) which need not be continuous, is piecewise linear 
with knots hj1,...,iz,, and q = (gt,q7). Then ¥;(x;) is piecewise quadratic. 
In general, if the distribution is defined in terms of splines of order ¢ at knots 
hity--+y hie, and g(y) is separable, say, 72, a (yi) with each g;(y;) convex, 
then ¥;(x;) can be shown to be convex and piecewise smooth. Suppose it is 
given by pieces Wj¢(x;) over intervals (he, A;,¢41). Then, analogously to (4.7) 
we can transform the problem (4.5) into the structured and smooth nonlinear 
program 


m3 k; 
minimize cz+ > Wye(hie + zie) - WV e(hie) 
i=1 €=0 


subject to Az =b 
Ki (4.8) 
Tit —)_ zie = hio, @=1,...,m9 
&0 


© > 0,0 < ze < de, €=0,...,h; 
with dig = hy,e41 — Rie. 


(Here again we could use the transformation (4.6) to introduce W = [J,—I| 
into the first stage). Note that (4.7) is a special case of (4.8). The optimal 
solution of (4.8) has an important property which is easy to prove. This result 
makes the nonlinear program (4.8) very amenable to solution by MINOS-like 
techniques and it is given by the following proposition: 


Proposition. In the optimal solution of (4.8), say (2*, zf,), if for some t, 2%, < 
dy, then 2%, = dye for all £ <t. 


Outline of Proof: Regard each ¥,¢(x,;) as the limit of a piecewise linear 
function, and then appeal to the standard argument used in the piecewise-linear 
case. 

The above proposition tells us that there are, at most, m2 superbasic 
variables (see Section 4.4 for terminology) in the optimal solution of (4.8). This 
would be to the advantage of a routine like MINOS, which thrives on keeping 
the number of superbasics low. These remarks will become clearer after looking 
at Section 4.4. Note also that Wets [70] discusses a special case of (4.8) when 
V+e(x;) are piecewise quadratic. A well-structured code for solving (4.7), which 
uses only the LP facilities of MINOS, would be capable of a natural extension 
to solve nonlinear problems of the form (4.8). MINOS was really designed to 
solve problem of this type. 
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The above approach remains limited in scope, because of the need to as- 


sume that recourse is simple and that the recourse objective is separable. There- 
fore we would not expect it to be useful for C2 problems. 


The transformations given by (4.5) and (4.6) are very useful prior to the ap- 


plication of other techniques discussed in the following sections of this chapter. 
Let us consider some possibilities. 


1. 


4.3 


When 7’ is nonstochastic, use of the transformation (4.5) in the methods 
described by Kall [80] or the L-shaped algorithm of Van Slyke and Wets 
[68] (see also Birge [6]) would lead to fewer nonzero elements in the rep- 
resentation of the associated large-scale linear programs. 


When T is not a fixed matrix, typically only a few columns (activities), 
say T2(w), would be stochastic. Say these correspond to variables , with 
x = (2,2). We could then introduce a redefinition of the problem in which 
a tender is associated with the nonstochastic columns, say T; of 7; then the 
degree of nonlinearity of the equivalent deterministic nonlinear program- 
ming problem would be m + dimension (#) instead of n). For example, 
for simple recourse with g = (gt, q_) we would have 


v{x,2,w) = min [gtyttqoy |yt —y” =h(w) — x —To(w) 4] 
yt,y~ 20 


¥ (x, 2) ore Eq|¥(x,#,0)] 
Note that Y(x,%) continues to be separable in x, ie. &(x,2) = i 
W;(x;,#). These observations and the further developments that they im- 
ply would be useful in a practical implementation. 


. Another interesting example of the use of the transformations involving 


tenders is given in Nazareth [52] where they are used in the solution of 
deterministic staircase-structured linear programs. 


Linearization Methods 


A prominent feature of methods in this group is that they solve sequences of 
linear programs. One can distinguish single-point and multi-point linearization. 
In both approaches convexity of functions is normally assumed. 
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4.3.1 Single-Point Linearization Methods 
We discuss this case very briefly. 

Consider the problem minimize,z<x f(z), where K is polyhedral and f(z) 
is smooth. The approach consists of solving a sequence of problems of the form: 


minimize Vs (2n)" (2 — az) (4.9) 


where K is the original polyhedral set K, possibly augmented by some addi- 
tional constraints. This leads to a variety of methods. When K = K we obtain 
the Frank-Wolfe [21] method, in which the solution, say 2}, defines a search 
direction dy = 2% —z~. The method has the virtue that the solution is found in 
one step if the original problem is linear. If K is augmented by the constraints 
lz — telloo < 4 for some small positive constant 5 we obtain the Griffith & 
Stewart [26] method of approximate programming (MAP); for minimax appli- 
cations see Madsen & Schjaer- Jacobsen [46] and for extensions to the domain 
of general nonsmooth optimization see the monograph of Demyanov & Vasiliev 


[13]. 


4.8.1.1 Applications to Recourse Problems 


For simple recourse when the equivalent (deterministic) nonlinear program is 
smooth, algorithms are given, for example, by Ziemba [77]. Kallberg and 
Ziemba [$3] use the Frank-Wolfe method in a setting where only estimates 
of functions and gradients can be obtained. The approach has been widely 
studied within the context of the general expectation model, see Ermoliev [16] 
and models with probabilistic constraints, see Komaroni [87] and references 
cited there. In this latter context, however, one needs to rely on a variant of 
the standard Frank—Wolfe method to take into account nondifferentiability (in- 
finite slope) of the objective at the boundary of the feasible region. Given a 
stochastic program with probabilistic constraints of the type 


minimize cz 
subject to Ar=b 
Prob [w|7'z > h(w)] >a 
z>0 
we see that it is equivalent to 
minimize cz 
subject to Ar=b 
Tze-x2>0 
g{x) 20 
z>0 
where g(x) =In(Prob [w|x > A(w)] — a). 
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Assuming that the probability measure is log-concave, it follows that g is con- 
cave and thus we are dealing with a convex optimization problem with one 
nonlinear constraint. Its dual is 


maximize ub+ p(v) 
subject to wA+eT <e 
v>0 
o{v) = inf[vx]g(x) 2 0. 


The function p is a sublinear (concave and positively homogeneous) finite-valued 
(only) on the positive orthant. If the probability measure is strictly log-concave, 
the function p is differentiable on the interior of the positive orthant and thus 
we could use the Frank—Wolfe procedure to solve this dua] problem as long as 
the iterates (u°,v°) are such that v° € interior RP? ; when v° is on the boundary 
of Ri , the standard procedure must be modified to handle the ‘infinite’ slope 
case, see Komaroni [37]. 


4.8.2 Multi-Point Linearization Methods 
Consider the problem 


minimize f(z) where g;(z) <0, t=1,....m,2EX (4.10) 


where all functions are convex, but not necessarily differentiable, and X is a 
compact set. We shall concentrate in this section on the generalized linear 
programming method (GLP) of Wolfe (see Dantzig [11], Shapiro [65]) which 
solves a sequence of problems obtained by inner (or grid) linearization of (4.10) 
over the convex hull of a set of points z!,...,2", to give the following master 
program: 


minimize ~\,f(z') 


K 
subject to ul) > MG (2') <0, a=1,...,m (4.112) 


f1 


K 
wiK) oy Mi =1,A; >0 


t=1 


where u(*) and w(*) represent the dual variables associated with the optimal 
solution of (4.11a). The dual of (4.11a) is 


maximize w 
subject to w < f(2\*))+ug(c), k=1,...,K (4.116) 
u>0. 
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The next grid point 2*+! is obtained by solving the Lagrangian subproblem 
minimize[/(2) + ul) 9(2)] (4.12) 


where u!%) is also the optimal solution of (4.11b). Convergence is obtained 
when 
f(elR +) + ul) g(a(K+1)) > wl), 
Since the dual of (4.10) is 


maaantize h(u), where h(u) = min(/ (2) +ug(z)) (4.13) 


and h(u) is readily shown to be concave, an alternative viewpoint is to regard 
the GLP method as a dual cutting plane (or outer linearization) method on 
(4.13) yielding (4.11b); new grid points obtained from (4.12) yield a supporting 
hyperplane to A(u) at u(*). 

It is worth emphasizing again that an important advantage of the inner- 
linearization approach is that it can be directly applied to the solution of non- 
smooth convex problems without extensions. 

Outer linearization could be applied directly to the functions in (4.10) 
to give a primal cutting plane method which also solves sequences of linear 
programs. For details, see Kelley [$4], Zangwill [76] and Eaves & Zangwill 
[13]. 


4.3.2.1 Applications to Recourse Problems 
For recourse problems, particularly with the form (4.5) using tenders, the GLP 
approach looks very promising. 

Using GLP to solve simple recourse problems has an early history. It 
was first suggested by Williams [74], in the context of computation of error 
bounds and also used at an early date by Beale [2]. Parikh [57] describes many 
algorithmic details. The method has also been implemented for specialized 
applications (e.g. see Ziemba [78], for an application to portfolio selection). 
However, as a general computational technique in particular, for nonsimple 
recourse it has apparently not been studied until recently, see Nazareth and 
Wets [56] and Nazareth [51]. 

The GLP method applied to (4.5) yields the following master program: 


K 
minimize c#+ So B(x) 
f=1 


subject, to pf) :Az=6 


K 
WO) Te — > Aex® =0 (4.142) 
k=1 


K 
pK) , > Ap =1 
k=1 


2>0,rA,>0. 
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The associated subproblem is 


minimize (x) + 1%) (4.148) 
xEx 


In order to complete the description of the algorithm it is necessary to 
specify X, and if this is not a compact set, to extend the master program (4.14a) 
by introducing directions of recession (whose associated variables do not appear 
in the convexity row). In addition, a suitable set of starting tenders which span 
RY should be specified. As discussed in Nazareth [51] these considerations can 
be largely circumvented by using the equivalent form (4.6) and solving master 
programs of the form: 


K 
minimize cz +qy + oe And (x) 
k=1 
subject to Azg=b 


K 
Te+Wy->, Nex ®) = (4.15) 
k=1 


K 
per 
k=1 


g>0,y>0,\,2>0. 


As discussed in more detail in Nazareth & Wets [86], we expect the above 
algorithm to perform well because nomnally only a few tenders will have nonzero 
coefficients in the optimal solution and because one can expect to obtain a good 
set of starting tenders from the underlying recourse program. 

Still at issue is how readily one can compute ¥(x(*)) and its subgradients 
at a given point y*. This in turn determines the ease with which one can 
solve the subproblem (4.14b) and obtain coefficients in the objective row of the 
master. 

For Cl problems (x) is separable and easy to specify explicitly (see Wets 
[72]). Algorithms have been given by Parikh [57] and Nazareth [61]. A prac- 
tical implementation is given in Nazareth [55] where further details may be 
found. 

For C2 problems (i.e. with complete recourse and a relatively small set 
of scenarios say, h®, £ = 1,...,Z with known probabilities fz,@=1,... ,L) one 
can solve the subproblem (4.14b) and compute ¥(x‘*)) in one of two ways, as 
discussed in Nazareth [51]: 

(i) Formulate (4.14b) as a linear program which can be efficiently solved by 
Schur-Complement techniques, see Bisschop & Meeraus [9], and Gill et al. 
[24]. The values &(x‘*)) are a part of the solution of this linear program. 

(ii) Use unconstrained nonsmooth optimization techniques, see Lemarechal 
[40],[41], Kiwiel [85] and Shor [66]. Information needed by such meth- 
ods is B(x) and its subgradient g(x) and this can be computed by 
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solving a set of linear programs of the form: 


o(x 4) = min(qy|Wy = hé — y()]. 
¥2 


Suppose 7® are the optimal dual multipliers of the above problem. Then 
L 
W(x) = Vo fed(x 44) 
&1 


i 
a(x) = 2> fer’. 
e=1 


This can be carried out very efficiently using the dual simplex method 
coupled with techniques discussed by Wets in [73]. 


The method based upon outer linearization mentioned at the end of the 
previous section has been widely used to solve stochastic programs with recourse 
(see Van Slyke & Wets [68], Wets [73] and Birge [6]). This is a particular form 
of Benders’ decomp osition [8] and it is well known that approaches based upon 
Benders’ decomposition can solve a wider class of nonlinear convex programs 
than approaches based upon the Dantzig-Wolfe decomposition, see, for exam- 
ple, Lasdon [$9]). We shall not however discuss this approach in any detail 
here because it is already studied, in depth, in the references just cited. 


4.3.2.2 Extensions 


When ¥(x*) and its subgradients are difficult to compute, the GLP approach 
continues to appear very promising but many open questions remain that center 
on convergence. 

Two broad approaches can be distinguished: 

(i) Sampling: Stochastic estimates of ¥(x) and its subgradient can be obtained 
by sampling the distribution. An approach that uses samples of fixed size 
and carries out the minimization of the Lagrangian subproblem (4.14b) 
using smoothing techniques is described by Nazareth [51]. Methods for 
minimizing noisy functions suggested recently by Atkinson et al. [1] would 
also be useful in this context. With a fixed level of noise, convergence 
proofs can rely upon the results of Poljak [58]. 

Another variant is to use samples of progressively increasing size tied to 
the progress of the algorithm and to solve the Lagrangian subproblem us- 
ing stochastic quasi-gradient methods, see Ermoliev & Gaivoronski [18]. A 
particular algorithm (suggested jointly with A. Gaivoronski) is to replace 
v(x) in (3.6) by some estimate ¥,(x'*)) which is based upon a suitable 
sample size N. When no further progress is made, then this sample size 
is incremented by AN and the approximation refined for all x(*) in the 
current basis. There are, of course, many further details that must be spec- 
ified, but under appropriate assumptions convergence can be established. 
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(ii) Approximate Distribution and Compute Bounds: At issue here is how to 
simultaneously combine approximation and optimization. For a oak 
Birge [7] assumes that converging approximations ¥%(y) and ¥4(x) are 
available for K = 1,2,.... and replaces U(x*)) in (4.14a) by the upper 
bound ¥ (y(k)),4 =1,...,K. In the subproblem (4.14b), if 


DE (xt) + 7A) (K+1) > g(K) 


then BE (xt) is computed. If, farther, the above inequality is sat- 
isfied using this lower bound in place of the upper bound, then x{*+!) is 
optimal. Otherwise the approximation is refined and the process contin- 
ued. Approximation schemes for obtaining bounds rely on the properties 
of recourse problems, instead of purely on the distance between the given 
probability distribution and the approximating ones; this allows for se- 
quential schemes that involve much fewer points as discussed by Kall & 

Stoyan [$1] and Birge & Wets [8]. 

The interpretation of the optimal solution in Nazareth [51], suggests the 
possibility of an alternative approach to approximation by an increasingly large 
number of points. It is shown that if Mb and x's) j= 1,...,(m2 +1) give 
the optimal solution of (4.14a), then the problem (4.5) is equivalent to the 
associated discretized problem obtained by replacing the distribution of h(w) 
by the distribution whose values are y{*) ,J =1,..., (mg +1) with associated 
probabilities Af _. Note that pce At, = 1, Ai > 0, so that these quantities 


do indeed define a probability distribution. 
Let us conclude this section with a discussion of some other possibilities. 


1. When the technology matrix is nonlinear, i.e. when 7 is replaced by a 
smooth nonlinear function, we have the possibility of a generalized program- 
ming algorithm where the master program itself is nonlinear. The question of 
convergence is open. Here an implementation based upon MINOS would be 
able to immediately draw upon the ability of this routine to solve programs 
with nonlinear constraints. 


2. When some columns of 7’ are stochastic, the transformation discussed at the 
end of Section 4.2 can also be used within the context of the GLP algorithm to 
keep the degree of nonlinearity low. This time inner approximation of ¥(x, 2) 
would be carried out over the convex hull of (x), 2), k=1,...,K. 


3. Generalized programming techniques appear to be useful for solving pro- 
grams with probabilistic constraints, for example, of the form: 


minimize cz 

subject to Ag =6b 
Prob [w|T'z > h(w)] >a 
z>0. 
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With the usual definition of tenders 72 = x and under the appropriate assump- 
tions on the distribution of h(-), we can express the above problem as: 


minimize cz 

subject to Az=b 
Tz-x=0 
g(x) <0 
z>0 


where g(x) = a— Prob [w|h{w) < x] is anonlinear function which is log-concave 
for a wide variety of distribution functions, in which case the set [x|g(x) < 0] 
is convex. In such a situation we can reformulate the constraint 


g(x) <9, 


. x € D= {ylg(y) < 0} 


where 


aty)=a~ friar 


Here p(-) denotes the density function of the random vector A(-). Assuming 
that we have already generated x!,...,x* in D such that 


{x =T2|Azr = 6,2 > 0} Ncof{x!,...,x*} #4, 
we would be confronted at step K with the master problem: 
minimize cz 


subject to o% :Ar=b 


K 
a® :Te- So Ax! =0 


f=1 


uae De tae 


z>0,A,20, t=1,...,K 


where (o* yak gk ) represent the dual variables associated with the optimal 
solution (x“,A*) of this master problem. The next tender y**! is obtained 
by solving the Lagrangian subproblem, involving only x: 


minimize [7* y|y € D] 
and this x“+! is introduced in the master problem unless 


aKy 2 ae 
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K 


in which case z* is an optimal solution of the master problem. To find 


x**) € argmin{x* y| p(¢)dg <a 
s<x 


we consider the Lagrangian function 


(x, 8) =2* x + B( p(s)d¢ — a), 2 >0 


$<X 


and the dual problem 
maximize h(f),2 > 0, 


where 


h(8) = inf[€(x, 8)|x]- 


The function h is an ¢-dimensional concave function, its (generalized) derivative 
is a monotone increasing function, and, moreover, under strict log-concavity 
of the probability measure, its maximum is attained at a unique point. To 
search for the optimal #“ we can use a secant method for finding the zero of a 
monotone function. We have that for fixed f, 


x(@) = argmin £(x, 9) 


is obtained by solving the following system of equations: 


—aK /p = B(S1y 006 9Gi—-1y eee sXe Sitdy e+ s$mg)ds yt = 1,...,mg. 
(se<XEleg¢i} 


If p is simple enough, or if it does not depend on too many variables then this 
system can be solved by a quasi-Newton procedure that avoids multidimensional 
integration. 

This application to chance-constrained stochastic linear programming is an 
open area and certainly deserves further investigation. 


4. It is also worth pointing out that generalized programming methods have 
been recently applied to the study of problems with partially known distribution 
functions (incomplete information), see Ermoliev et al. [17] and Gaivoronski 
[23]. 
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4.4 Variable Reduction (Partitioning) Methods 


Methods in this group seek to restrict the search region to one defined by a 
subset of the variables and carry out one or more iterations of a gradient (or 
subgradient) based search procedure. The search region is then revised and 
the process continued. We can make a distinction between ‘homogeneous’ and 
‘global’ methods (using the terminology of Lemarechal [42]). Homogeneous or 
active set methods, in the linearly constrained case, restrict the region of search 
to an affine subspace within which unconstrained minimization techniques can 
be used. We shall concentrate on the reduced gradient formulation of Murtagh 
& Saunders [49],[50] as implemented in MINOS and seek extensions of an 
approach which has proved effective for large smooth problems. However the 
fact that extension is necessary, in contrast to the methods of the previous 
section, and the fact that there are theoretical issues of convergence that remain 
to be settled mean that such methods are still very much in the development 
stage. 

Global methods treat all constraints stmultaneously and define direction 
finding subproblems which usually involve minimization subject to inequality 
constraints (often just simple bound constraints). Convergence issues are more 
easily settled here. We shall consider some methods of this type. 

We also include here approaches where the partition of variables is more 
directly determined by the problem structure, in particular the grouping into 
linear and nonlinear variables. 

Consider first the problem defined by 


minimize f(z) 
subject to Az =b (4.16) 
z>0 


where, initially, f(z) is assumed to be smooth. 

The variables at each cycle of the Murtagh and Saunders [49] reduced 
gradient method are partitioned into three groups, (zg, %5,2y) representing m 
basic variables, ¢ superbasic variables, and nb = n — m — #6 nonbasic variables 
respectively. Non-basics are at their bound. A is partitioned as [B|S|N] where 
B is an m x m nonsingular matrix, S is an m <6 matrix, and N is an m x nb 
matrix. Let g = Vf (z) be similarly partitioned as (gz,95,9N)- 

Each cycle of the method can be viewed as being roughly equivalent to: 


RGi one or more iterations of a quasi-Newton method on an unconstrained 
optimization problem of dimension e determined by the active set Az = b,2y = 
0. Here a reduced gradient is computed as 


w= 9s — (9pB~')S =[-(B-'S)* Lax olO]g = 25 9. (4.17) 


The columns of Zs span the space in which the quasi-Newton search direction 
lies, and this is given by p = —Zs HZ2 g where H is an inverse Hessian approx- 
imation obtained by quasi-Newton update methods and defines the variable 
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metric, e.g. H = J gives the usual projected gradient direction. Along p a line 
search is usually performed. (Note that in actual computation H would not 
be computed. Instead we would work with approximations to the Hessian and 
solve systems of linear equations to compute the search direction p.) 

RG2 an iteration of the revised simplex method on a linear program of dimen- 
sion m x nb. Here components of the reduced gradient (Lagrange multipliers) 
corresponding to the nonbasic components are computed by 


A =gw — (gpB™')N (4.18) 
A= [-(B71N)* |0[Znox nb|g fo ZNG: (4.19) 


This is completely analogous to the computation of y in (4.17) above. The 
difference is in the way that \ is used, namely to revise the active set. In each 
case above prices 7 can be computed by s =g$B~! and yu and \ computed as 


w=gs—7"S,A=gn—7°N (4.20) 


(It is worth noting that the convex simplex method is a special case of the above 
where (RG1) is omitted and (RG2) is replaced by a coordinate line search along 
a single coordinate direction in the reduced space given by (Zw )«, say, for which 
Ag < 0. When there are nonlinear constraints present the above method can 
also be suitably generalized.) 

In the nonsmooth case we can proceed along three main directions: 


1. Compute y and 2 in place of the above by 


w= Z3 {argmin[g” (Z5Z¢ )glg € Of (2)}} 
A = Zy {argmin(g™ (Zw Zy alg € OF (z)]} 


where Of{z) is the subdifferential of f(z) at z. In effect we are computing 
steepest descent directions in the appropriate subspaces. Note that it is, in 
general, not correct to first compute a steepest descent direction 7 from 


j= argmin|g™ g|g € Of (z)] 


and then reduce @ to give 
w=Z3g 


4.22 
A= Zi. on) 


The reason for this is that the operations of minimization and projection are 
not interchangeable. However this approach does make it possible to restore 
use of the + vector and therefore yields useful heuristic methods, as we shall 
see in the next section. In order to ensure convergence, it is necessary to 
replace Of(z) by 8. f(2)—the e-sub differential (except in special circumstances 
e.g. when f(z) is polyhedral and line searches are exact). This is useful from 
a theoretical standpoint. However, from the point of view of computation it 


112 Stochastic Optimization Problems 


is usually impractical to use the subdifferential, let alone the ¢-subdifferential 
(except again in rather special circumstances). One such instance is when the 
subdifferential is defined by a small set of vectors, say, g1,.--,gn- Then (4.21) 
leads to the problem: 


minimize g ZsHZag 


N 
subject to g= > \i(Z5 91) 

i=! (4.23) 
ry =1 


= 


N 
[it | 

Ay 20 
If g* is its solution, then « = Z2g*, with a similar computation for ZR. We 
also have p = —ZoHZig*. 


2. Utilize bundle methods in which the subdifferential is replaced by an approx- 
imation composed from subgradients obtained at a number of prior iterations. 
For the unconstrained case algorithms are given by Lemarechal [40],[41] and 
an implementable version is given by Kiwiel [35]. An extension of [40] to han- 
dle linear constraints in the reduced gradient setting is given by Lemarechal et 
al. [45]. However, as the authors point out theoretical issues of convergence 
remain to be settled in the latter case. 


3. Utilize nonmonotonic methods (see, for example, Shor [66]) which require 
only a single subgradient at each iteration. In effect nonmonotonic iterations 
will be carried out in subspaces (see RG] and RG2 above) determined by Zs 
and Zy, using reduced subgradients Z/g and Zjg. Again convergence issues 
Temain open. 


Line searches suitable for use in the above cases (1) and (2) are given by 
Mifflin [48] and Lemarechal [43]. 

The reduced gradient method as formulated above benefits from additional 
structure in objective and constraints, in particular the partition between vari- 
ables that occur linearly and variables that occur nonlinearly. We shall see 
instances of this in the discussion of recourse problems. In particular, it is easy 
to show that when f(z) is replaced by cz + ¥(x), an optimal solution exists 
for which the number of superbasics does not exceed the number of nonlinear 
variables x. 

Instead of obtaining an active set from zy = 0, another approach which 
gives a ‘global’ method is to reduce the gradient or subgradient only through 
the equality constraints Az = 6 (these are always active) and define reduced 
problems to find the search direction involving bound constraints on the zy 
variables. This is discussed in Bihain [5]. (See also Strodiot et al. [67].) 

Reduced gradient methods, as discussed above, benefit from the partition 
of the problem into linear and nonlinear variables, but they do not explicitly 
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utilize it. It is however possible to take more immediate advantage of this 
partition. Possible approaches are given, for example, by Rosen [62] and by 
Ermoliev [15]. Consider the problem 


minimize cz +F(y) 
subject to Az+By=6 
z20,y>0 


If the nonlinear variables y are fixed at certain values we obtain a simpler 
problem, in this case a linear program (which may have further structure, for 
example, when A is block-diagonal). The optimal dual multipliers 7* of this 
linear program (assumed feasible), can then be used to define a reduced sub- 
problem, for example, F(y) — (7*)"By,y > 0. This is then solved to revise 
the current values of y, for example, by computing a reduced subgradient by 
g — 7° B,g € F(y) and carrying out (nonmonotonic) iterations in the positive 
orthant of the y variables (see Ermoliev [15]). An alternative approach is given 
by Rosen [63]. 


4.4.1 Applications to Recourse Problems 


Since the number of nonlinear variables x in (4.5) is usually small relative to 
the number of linear variables, the reduced gradient approach outlined above 
is a natural choice. When ¥(x) is smooth (and the gradient is computable) 
the reduced gradient method can be used directly. In the form of the convex 
simplex method, which is a special case of the reduced gradient method, it has 
been suggested for the simple recourse problem by Wets [69] and Ziemba [77]. 
Wets [71] extends the convex simplex method to solve problems with simple 
recourse when the objective is nonsmooth. 

For Cl problems 3¥;(x;) = {v; ,4;'] (see Nazareth & Wets [56]). The 
computation of 4 and A in (4.21) thus requires that we solve bound constrained 
quadratic programs. We can utilize structure in the basis matrix in defining 
these quadratic programs. Since the xy variables are unrestricted, they can be 
assumed to be always in the basis. A basis matrix will thus have the form 


gue 
= (F 7 (4.24a) 


and its inverse (never, of course, computed directly) will therefore be given by 


D-! 0 
Bis= BBE ok ‘ (4.24) 


Let gs = (¢B,9x) where cg are coefficients of the objective row corresponding 
to the z variables in the basis and g, is a subgradient of ¥(x) at the current 
value of x. Also, since superbasics and nonbasics are always drawn from c, we 
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shall use cs and cy in place of gs and gy. Thus we define g = (cz,gy,cs,¢n)- 
The quadratic programs (4.21) then takes the form 


minimize g)ZsZeg aoe 
subject to v; <(gy)i S07, t= 1,...,mg (a2) 
where g is defined above, 72 = (—(B™!S)"|J,x.|0) with B defined by (4.24a). 
Note that usually g, will have relatively few components. A similar bound 
constrained quadratic program can be defined for Z}. Both can be solved 
very efficiently using a routine like QPSOL, see [25]. The above approach also 
requires a line search and an efficient one based upon a specialized version of 
generalized upper bounding, is given in Nazareth [54]. An implementation 
could thus be based upon MINOS, QPSOL and this line search. 
It is possible to avoid the use of quadratic programming by using a heuristic 
technique in which a steepest-descent direction is first computed as the solution 
of the expression preceding (4.22). This is given by: 


minimize Ix Ix 


4.26 
subject to v, <(g,)i <u, t= 1,...,mg 428) 
The solution g, is given explicitly by: 
vw, ifv, >0 
GJi= 50 OE [v7.07] . (4.27) 


ot if ot <9 

Projected quantities Z27 and Z97 can then be computed with 7 defined anal- 
ogously to g (just before expression (4.27)). This and use of the line search in 
Nazareth [54] suggests a very convenient heuristic extension of MINOS. Even 
the construction of a specialized line search can be avoided by utilizing line 
search methods designed for smooth problems (again heuristic in this context) 
as discussed by Lemarechal [44]. 

For C2 problems, computing y and \ by (4.21) again requires that we solve 
the following special structured quadratic program (Nazareth & Wets [56]): 


find g € R™ such that |lg||%, is minimized 
such that 


Ix =Sve =1Lf§, and 1°W <q,7°(he— x) > B(x,A%),€=1,...,0 


where h® and f© define the probability distribution of the scenarios, as in Section 
4.3.2.1. M defines the metric and for different choices, the objective takes the 
form g™g (or equivalently, in this case, 9x 9x ), g Zs Zdg or g Zn Zhg. Again 
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special purpose techniques can be devised to solve such problems. It is how- 
ever often impractical to consider use of the above steepest descent approach 
because only ¥(x) and a subgradient are available. In this case an algorithm 
would have to be designed around bundle techniques or nonmonotonic opti- 
mization as discussed in Section 4.4, items (2) and (3) (after expression (4.23)), 
using reduced subgradients given by Z2g and Z#g, with g and other quantities 
defined as in the paragraph preceding (4.25) In this case an implementation 
could be based upon a routine for minimizing nonsmooth functions, see Bihain 


[5]. 

In the above methods the x variables would normally always be in the 
basis, since they have no bounds on their value. This means that there are 
always some variables in the basis which correspond to the nonsmooth part of 
the objective function. An alternative approach is to try and restore a more 
simple pricing strategy by keeping the y variables always superbasic and define 
a basis only in the ¢ variables. The alternating method of Qi [61] is an attempt 
in this direction although it is not implementable in the form given in [61]. 
Other methods along these lines are given by Birge [6]. However, the numerical 
results given by Birge [6] show that the approach may not be as promising 
as the method based upon outer linearization (the so-called L-shaped method) 
mentioned at the end of Section 4.3.2.1. 


4.4.2 Extensions 


As with generalized linear programming, we think that much can be done by 
extending the above approach, when ¥({x) and its subgradient are hard to 
compute, but there are many open questions. As in Section 4.3.2.2, two broad 
approaches can be followed: 


(i) Sampling: Potentially the most valuable approach seems to be an alter- 
nating method in which one would carry out iterations in the x space and 
combine them in some suitable way with subgradient (or stochastic quasi- 
gradient) iterations in the z space {along the lines suggested by Ermoliev 
[15]). It is also possible to consider ‘homogeneous’ or active set methods 
which extend the reduced gradient approach and interleave iterations in- 
volving two projection operators into the space defined by superbasic and 
nonbasic variables respectively. 


{ii) Approximate Distribution and Compute Bounds: For a discussion of this 
approach see Birge & Wets [8]. 
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4.5 Lagrange Multiplier Methods 


We conclude this chapter with very brief mention of methods which have 
recently achieved much popularity for smooth and nonsmooth optimization 
and are thus likely to lead to useful methods for solving recourse problems. 
Bertsekas [4] and Powell [59] give comprehensive reviews in the smooth case. 
Lemarechal [42] explains connections with minimax optimization and other 
methods of nonsmooth optimization. 

A distinguishing feature of methods in this category is that they combine 
cutting plane techniques with use of a quadratic penalty term in the compu- 
tation of search directions and that they often treat the constraints ‘globally’, 
again in the sense of Lemarechal [42]. For an example of the use of a (parame- 
terized) quadratic penalty term in unconstrained minimization see the proximal 
point method of Rockafellar [63]; in smooth nonlinear programming, see Wilson 
[75] and in nonsmooth optimization, see Pschenichnyi & Danilin [60]. 

Consider the problem 


minimize f(z) 
subject to Az=b (4.28) 
z> 0. 


The search direction finding problem then takes the form: 


minimize + (1/2)d"Bd 
subject to v > —a; t+gidie I 
Az=b 
z>0 


(4.29) 


where I denotes an index set and g;, 7 € I a set of subgradients of f(z). a; 
is a scalar. If B = 0, I has only one element and f(z} is smooth (so that g; 
corresponds to a gradient), note the connection with the method of Frank & 
Wolfe [21] (see also Section 4.3.1). When B = J, the identity matrix, we have 
the method suggested by Pschenichnyi & Danilin, see [60]. 

By dualizing (4.29) it is easy to establish ties with steepest descent meth- 
ods determined by bundles of subgradients in the appropriate reduced space 
together with the appropriate definition of a metric (see (4.23) and also Han 
[27],[28], Lemarechal [42], Kiwiel [85] and Demyanov & Vasiliev [12]). Re- 
cently Kiwiel [86] has suggested a method which further exploits the structure 
in (4.23) and has also considered extensions of methods under consideration in 
this section when there is uncertainty in the value of the function. 

Finally, for application of ideas underlying Lagrange multiplier methods 
to stochastic programs with recourse, see Rockafellar & Wets [64], Merkovsky, 
Dempster & Gunn [47]. 
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CHAPTER 5 


NUMERICAL SOLUTION OF PROBABILISTIC 
CONSTRAINED PROGRAMMING PROBLEMS 


A. Prékopa 


5.1 Introduction 
In this paper we present solution techniques to problems of the following kind 


minimize h(z) 
subject to ho(z) = P(g: (z,€) > 0,...,9-(2,€) 2 0) > p, (5.1) 
hy (x) > Pigs ++) Rm(z) > Pm; 


where for the sake of simplicity we assume that the functions h,h),...,Am are 
defined on the whole n-dimensional space. Similarly, the functions gi(z,y),..-, 
gr(z, y) are supposed to be defined on the whole n+q-dimensional space, z € R", 
y € R%. For the probability p the notation po will also be used. 

Various engineering and economic problems can be cast into this form. 
Now we do not intend to survey the applicational models belonging to this 
category. We only refer to a few papers [7|-[11], where the interested reader 
may find model formulations and references to applications. 

The most important special case of Problem (5.1) is obtained by special- 
izing the functions g;(z,y), 7 =1,...,7 so that 


9 (2,9) =Tj2z-y%, t=1,...,7 


where T,,-..,7, are rows of an r xX m matrix T. In this case the probabilistic 
constraint in Problem (5.1) takes the form 


P(Tz > €) >p. (5.2) 


Introducing the notation F(z) for the joint probability distribution function of 
the components of the random vector 6, i.e., F(z) = P(€ < z), the constraint 
(5.2) can be written in the following manner 


F(T2) >p. (5.3) 


Before proceeding to describe the numerical solution techniques to Problem 
(5.1) we mention the following theorem that serves as a basis of the convergence 
theory in many special cases. For the proof of the theorem we refer to the 
summarizing paper {12] and the references there. 
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Theorem 5.1. Ifgi(z,y),...,g+(%,y) are concave functions in R"*4 and € has 
a continuous probability distribution with logarithmically concave probability 
density function f, 1.e., for every 21,22 € R" and 0 < A <1 we have 


f (Aer + (1— A)e2) > F(a) PL (e2)] > 
then the function ho is also logarithmically concave in R", 


This theorem implies that if € has the required property then the function 
standing on the left hand side of (5.2) is logarithmically concave. 


Maximization of Probability and the Method of Two Phases 
Together with problem (5.1) we also formulate the problem 


maximize hg(z) = P(gi(z,€) > 0,...g,(x, €) > 0) (5.4) 

subject to hy(2) > p1,-..,hm(z) > pm- . 
This problem has practical importance too. Many reliability problems belong 
to this category. For one practical application we refer to the paper [18] where a 
sequential decision process consists of a sequence of problems of the type (5.4). 

Another importance of problem (5.4) is that when solving problem (5.1) 
a two-phase method can be applied where in the first phase we seek a feasible 
solution and in the second phase we solve the original problem. Assuming that 
we possess a method to find a feasible solution to the system of inequalities 
hi(z) > p1,---,Am (2) = pm, a feasible solution to problem (5.1) can be found 
in such a way that we start to solve problem (5.4) and stop the procedure when 
we reach an z satisfying ho(z) > p. This z is a feasible solution to problem 

5.1). 

ls the solution of problem (5.1) we propose the application of suitable 
nonlinear programming methods supplied by Monte Carlo simulation proce- 
dures to find function values and gradients of the function ho. There exist 
other proposals too to solve stochastic programming problems among which the 
stochastic quasi gradient method of Yu. Ermolev and his collaborators should 
be mentioned. There is, however, little experience regarding how this method 
works in case of problem (5.1) and (5.4). On the other hand the application 
of the already well developed theory and techniques of nonlinear programming 
seems to be advantageous to apply. In this case, among others, we are able 
to present optimality criterion which helps us to check the termination of the 
applied optimization procedure. 

A nonlinear programming problem which is proved to be effective in case 
of deterministic nonlinear programming problems is not necessarily effective in 
case of the solution of problems (5.1) and (5.4). The reason is that in problems 
(5.1) and (5.4) each value of the function Ao is the probability of a set in R? and 
these values furthermore the values of the gradient of ho are calculated by Monte 
Carlo simulation. This letter gives us a satisfactorily accurate value provided 
the sample size is chosen large enough. However, we are able to do so only in the 
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case if the effect. of the Monte Carlo simulation can be well controlled, i.e., the 
effect of this kind of randomness can clearly be seen throughout the procedure 
and the numerically unstable steps can be avoided or at least controlled. 


5.2 The SUMT Method with Logarithmic Penalty Function 
We introduce the following assumptions: 

© 0<p<1,p; >0,...,pm > 0, A is convex in R", 

© hy,...,;Am are continuous logconcave functions in R", 

© 91,.++,gr are concave functions in R"*?, 

e the set of feasible solutions is compact, 


e there exists an 2 satisfying h;(z) > pj, 7=0,...,m, 
e € has a continuous probability distribution with logarithmically concave 
density. 


The Sequential Unconstrained Minimization Technique [2] applied to our 
problem works in the following manner [10]. We define the penalty function 


T(z, 6) =h(z) - DS invite) (5.5) 


for every 2 satisfying h;(z) > p;,2 =0,...,m and for every fixed ¢ > 0 where 
M, is the maximum of h; (z)— p; on the set of feasible solutions. Take a positive 
sequence e! > 82... with the property that lim,_... s* = 0 and minimize the 
function T(z, 8*) for every fixed s*. As the set of feasible solutions is compact 
then the minimum of T(z,6") exists. Let 2* be an optimal solution to this 
problem. Then we have the relation 


. k ky * k et . 
jim T(z ,8") = lim h(x") = min h(2) (5.6) 


where D denotes the set of feasible solutions. It is remarkable that under 
the mentioned assumptions the function T(z, 8) is a convex function for every 
fixed ¢ thus various unconstrained optimization techniques work effectively. To 
compute the values and the gradients of h, remain difficult problems to which 
we return later. The sequence s!,8?,... in practice is chosen as a geometric 
sequence and the procedure frequently stops after a few number of steps. 
Below we prove two theorems which help to check properties generally 


required when solving optimization problems by the SUMT method, 


Theorem 5.2.1. If a function h is logconcave on the convex set given by the 
relation 


H= {2|h(z) 2 P}, 


where p is a fixed probability satisfying the inequality 0 < p < 1 then the 
function h(x) — p is also logconcave on the set H. 


Proof, Let 2,y¢€H,2#y and 0 < 4 < 1. Then since & is logconcave on H 
we have the inequality 


h(da + (1- A)y) — p 2 [A(2)P[A(y)]'~* — vp. 
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Setting h(z) =a, h(y) = 6, it will be enough to prove the inequality 
ab!» —p> (a —p)*(b —p)', 


Dividing by a*b'~> on both sides we obtain 


p\\ (p\!- | (a-p\* (b-p\'™* 
t47) G) > ( a ) b 
Now, using the arithmetic mean-geometric mean inequality, we derive 
OG) Ge) 
a b a b 
Pp 


p a—p 6 p 
< \i = is ——— — \j)———. = 1. 
Xd +(1 \) +i +(1 ) 1 





This proves the theorem. 

Theorem 5.2.1 shows that under the conditions introduced in the beginning 
of this section the function T(z, 6) is convex for every fixed s > 0 on the set of 
a vectors satisfying the inequalities h;(z) > p;, 7 =0,1,...,m. 


Theorem 5.2.2. Suppose that in problem (5.1) the assumptions introduced 
in Section 5.2 hold and let z be a nonboundary point of the set of feasible 
solutions. Then we have 


h;(z) > pit =0,1,...,m. 


Proof. By the assumptions introduced in the beginning of this section there 
exists an z satisfying the inequalities 


hj(z) > p;,t =0,1,...,m. 
We may assume that z # z. For some yp > 1 the point 
y=2+p(z-2) 


is a boundary point of the set of feasible solutions. Using the notation A = 1/4 
we obtain 


z=)y+(1—-A)z. 


By the logconcavity of the constraining functions and taking into account the 
inequalities p; > 0,7 =0,1,...,m, we obtain 


hi(2) = hi(Ay + (1 — A)2) 2 [Ai (y)]* [As (2)]'-* 


2 ps [hi (z)]' > > php; > = Dis 


2=0,1,...,m. 
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This proves the theorem. 

Theorem 5.2.2 states that on every nonboundary feasible solution of prob- 
lem (5.1) the penalty function (5.5) is defined and this makes possible the proof 
of the limit relation (5.6) also in the case if the optimal solution is on the 
boundary of the set of feasible solutions. 

Finally we remark that the application of the SUMT method is particularly 
advantageous in cases when the calculation of the gradients of ho (and eventu- 
ally also of h;, 7 =1,...,m) would be sophisticated not so much because of the 
probabilistic nature of hg but because of the special structure of the functions 
91;+++ygm- In such cases gradient-free techniques may be applied to minimize 
T (2,8). 


6.3 Solution by the Method of Feasible Directions 
The following assumptions are introduced: 


e The probabilistic constraint has the form (5.3), 
® his convex and has continuous gradient in R", 
e hy,...,4m are quasi-concave and have continuous gradients in R”, 
e The constraints in which the constraining functions are linear determine a 
bounded set, 
e there exists an 2 satisfying h;(z) > p;,7=0,...,m, 
e € has a continuous probability distribution with logarithmically concave 
density. 
The method uses subsequent linearization of the constraints and the ob- 
jective function. We start from an arbitrary feasible vector z! and if z!,...,2* 
are already fixed then first we solve the following direction finding problem: 


minimize y 
subject to Vh(2*)(z— 2") <y 
hy (2*) +Vh; (2*) (2 _ z*) > py, (5.7) 


Vi, (2*}(z - a*) +6;y > 0, if h;(a*) =P, 
and h, is a nonlinear function,: = 0,1,...,m, 
where the 6; are fixed positive numbers not depending on the individual prob- 
lems (5.7). If 2} is an optimal solution of problem (5.7) then we solve the 


following step length finding problem: 


min h(z* + r(2% — 2*)), (5.8) 


where the minimization is extended over such values for which z* +X(2F ~—z*) 
is feasible. If \* is an optimal solution of problem (5.8) then we define 


gktl a gk 4 MF (xf - z*). 
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Under the assumptions introduced in the beginning of this section the following 
limit relation holds 
jim h(a*) — min h(2). (5.9) 


The above procedure was published by Zoutendijk [16]. The convergence 
proof under the mentioned conditions is presented in [5]. Of particular interest. 
is the case where all constraining functions but Ao are linear. Writing h;(z) = 
aia, i=1,...,m and h(z) =c'z, the problem is to 


minimize c'z 
subject to P(Tz>é)>p (5.10) 


aje>p, t=1,...,m. 
The first phase problem is to find a feasible solution to (5.10) is the following 


maximize P(T'x > €) 


: : : (5.11) 
subject to ajz>p,, a=1,...,m. 

When maximizing the objective function in problem (5.11) we can stop the 

procedure whenever we reach an z satisfying 


P(Tz>€)2>p. (5.12) 


On the other hand if we perform it as long as the inequality (5.12) holds strictly 
we have numerical evidence that the regularity condition (the second to the last 
condition) holds true. 

If the probability P(T'z > €) is positive in the set of feasible solutions 
then we take its negative logarithm and minimize this rather than maximize 
the original probability. Thus the new problem, equivalent to problem (5.11), 


is the following 


minimize — log P(Tz > 
ae (Tez @) (5.13) 
subject to a;2> pj, ¢=1,...,m. 
The gradient of the objective function in problem (5.13) can be computed on 
the bases of the equality 


1 


VP(T2 > €). 


The method of feasible directions is considered today a slow method to 
solve nonlinear programming problems. Taking into account aspects that arise 
concerning probabilistic constrained programming problems we cannot be as 
dissatisfied with its performance. Problems (5.7) and (5.8) clearly show how 
accurately we have to compute the function values and the gradient values in 
order to obtain good approximations. 
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5.4 Solution by the Supporting Hyperplame Method 
We introduce the following assumptions: 
© there exists a bounded convex polyhedron K! such that the set of feasible 
solutions is contained in K!, 
® the functions —h,h),...,m are quasi-concave and have continuous gradi- 
ents on K!, 
© there exists an z such that h,(z) > p;,i =0,...,m, 
 € has continuous probability distribution and logconcave density in R” 
furthermore fp has continuous gradient in R". We assume that we have 
an initial feasible vector z'!. Then we perform subsequent iterations where 
the &* iteration in this method consists of two subsequent steps. 


Step 1. Solve the problem 


minimize h(z) 
subject to cE K*, 


where K* is a convex polyhedron. Let z* be an optimal solution to this problem. 
If Ay (z*) > p;, ¢ = 0,...,m then z* is an optimal solution to problem (5.7). 
Otherwise go to Step 2. 
Step 2. Let \* be the largest A(0 < A < 1) for which the following inequality 
holds 

Ay{a! + A(2* — 21)) > pi, 1=0,...,m. 


Various one-dimensional methods can be applied to solve this problem. Let 
y* = gl + AF(a* _ z'), 


If h(y*) — A(z*) < © where € is a previously chosen small positive number then 
we stop and accept y* as an approximate solution to the optimization problem. 
Otherwise choose a subscript 7, for which hi, (y*) = 0 and define 


KF+tl = {alee K*,Vhi, (y*)(z —y*) > 0} 


and go to Step 1 using & +1 instead of &. Under the mentioned assumptions 
the procedure is convergent in the sense that 


lim h(z*) = min h(2). 


k-+00 rE 


This method was published in [14] and applied to solve probabilistic constrained 
programming problems in [9]. 
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5.5 Solution by a Variant of the General Reduced Gradient Method 


A variant of the GRG method [1] suitably adapted to problem (5.1) where 

the stochastic constraint reduces to (5.2) and the other constraints are linear 

has been reported in [4]. It differs from the GRG method primarily in the 

formulation of the direction finding problem. Here we generate always feasible 

solutions and thus we avoid the application of intermediate methods to return 

to the feasible set which is very important because our function values are noisy. 
The problem to be solved is now formulated in the following form: 


minimize h(z) 
subject to ho(z)=P(Tx>€)>p 
5.14 
Az=6 (5.14) 
z>0. 


Concerning this problem the following assumptions are introduced: 

~ the random variable € has a continuous probability distribution with log- 
concave density function, 

— Vho(z) is Lipschitz-continuous and bounded in R", 

— there exists a feasible z such that ho(z) > p, 

— the m x » matrix A has rank equal to m and for every feasible z there 
exists a basis B such that 2; > 0 for 2 € Ig and Iz is the set of subscripts 
of the basis vectors. 

We start from a feasible solution z to problem (5.14) and assume that a 
basis B of the columns of A can be found which, for the sake of simplicity is 
assumed to consist of the first m columns of A, with the property that when 
applying the partition A = (B,C) and the corresponding partition of z is 


z’ = (w’,z’) then all components of w are strictly positive. We will have a 


direction finding problem and a setp length determination problem. 
Direction finding problem. First we formulate the following problem 


minimize y 

subject to Vyh(z)ut+ Vsh(z)o <y 
Viho(z)u + Viho(x)v + by 2 0, if ho(z) =p, (5.15) 
Bu+Cv=0, 
20, ifz, =0, t=1,...,2—m,]lv]] < 1. 


Here # > 0 is a fixed number and the partition t’ = (u’,v’) corresponds to the 
partition of z' = (w’,z’). Introducing the row vectors 


r=V,h(z) — Vuh(2)B1C, 
8 = V,ho(z) — Vyho(2)B™1C, 
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which are called reduced gradients, problem (5.15) can be rewritten in the 
following manner 


miny 
rusy 
su + Oy > 0, if ho(z) =p, (5.16) 
vw, 20, ifz;=0, t=1,....n—m, 
lol <1. 


It can easily be proved that the optimum value of (5.16) is equal to zero if and 
only if z is a Kuhn-Tucker point. If this is not the case then the optimal value 
of problem (5.16) is negative and if v*, y* is an optimal solution of this problem 
furthermore u* = —B~! Cv" then 


* 
v=(%) 
is a feasible direction such that along this the function h is strictly locally 
decreasing. 

If the norm ||v|{ is chosen in the following manner |]v|| = max, |v,;| then 
problem (5.16) becomes a two row LP with individual lower resp. upper bounds 
which can easily be handled. Here we are able to take into account the imac- 
curacy in the evaluation of Vhy. The accuracy can be increased by taking a 


larger sample in the Monte Carlo evaluation. We remark that when updating 
the reduced gradients standard LP technique can be used. 


Step length determination. Starting from the interval allowed by the non- 
negativity restrictions we apply a linear search technique to find a point for 
which the nonlinear restriction holds with equality. Then we minimize the ob- 
jective function on the line segment between z and this point. In this one 
dimensional optimization we optimize with respect to \ i.e. we solve the prob- 
lem 

min h(z + At*). 


If its optimal solution is A* then the new feasible solution will be 


(1) * 
a). fw _{w «fw 
0 (Sn )=(T)40 (% 
provided all components of w are strictly positive. Otherwise by applying sub- 
sequent pivoting we find a basis B“) with the property that the corresponding 
components of x“) are already strictly positive. 
For the sake of simplicity, we did not include into the algorithm all techni- 


calities ensuring the convergence. The paper (4) already referred to gives a full 
description of these. 
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5.6 Solution by a Primal-dual Type Algorithm 
The problem to be solved has the following form: 


minimize c's 
subject to F(y) >p (5.17) 
Tz >y,Bz > d, 


where  € R" and y € R*. We assume that the multivariate probability 
distribution function F is strictly logarithmically concave and has continuous 
gradient in R”. We will shortly describe the method proposed in (3). 

To this problem we assign a problem that we will call dual problem al- 
though it is not dual in the classical sense. This dual is the following: 


TutBu=c 
u 20,020, (5.18) 


max| min wy + 0’d]. 
F(y)2>p 


The procedure works in the following manner. First we assume that a pair 


of vectors (u!,v!) is available for which 


(uo) EV = {u,v|T’u + B’v =c,v > O}. 


Suppose that (u*,v*) has already been chosen, where u* > 0. Then the follow- 
ing steps have to be performed. 


Step 1. Solve the problem 


ee k 
minimize y" y 
subject to F(y) > p. 


Let y(u*) denote the optimal solution to this problem. Then we solve the 
following direction finding problem 


maximize [u’y(u*) + d’v] 
subject to (u,v) EV. 


Let (ug, vp) be an optimal solution to this problem. If ug = pu* then (uf, vf) 
is an optimal solution of the dual problem and the pair z,y(u*) is an optimal 
solution of the primal problem where ¢ is an optimal solution of the linear 
programming problem: 


minimize c's 
subject to T2> y{u*), Ba > d. 
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Otherwise go to 
Step 2. Find \*(0 < A* < 1) satisfying 


a AF Etat) >uly(ut) tod 
ky T-)F" Up um ylu vu” d. 


Then we define 
aft) = uk + (1 — A* Juz, 


gett & yey ky (1- Mug. 


If the procedure is infinite then it can be proved that the sequence (u*,o*) 
converges and the limiting pair has the same property as (uj, v;) in Step 1. 


5.7 The Polynomial Distribution 

A special multivariate probability distribution has been introduced by the au- 
thor to approximate the distribution of €. This is defined on the unit square of 
the n-dimensional space by its probability distribution function as follows: 


F(z, ...52n) mS aig tbacseain 
oo F 
if 
0< 2; <1,i=1,...,N, (5.19) 
F(z1,--+,2n) is suitable defined otherwise. Here a;; < 0,...,Qjn S 0, ayy + 
«es $ Qin <0, ¢=1,...,N and c; > 0,...,cy > 0; furthermore these are 
constants. 

If a mathematical programming problem has the form of a geometric 
programming problem and in addition a probabilistic constraint of the type 
F(z) > p is included where F(z) is of the above type then the new problem 
is again a geometric programming problem for which methods of solution are 
available. 

We will not consider the algorithmic solution of problems of this type in 
detail. Our purpose here is to show that under certain conditions the func- 
tion (5.19) will in fact be a probability distribution function. To illustrate the 
situation we restrict ourselves to the case of n = 2. 


Theorem 5.7.1. If the following conditions holds: 
a1 S128... <£ aA1n, 
a91 2022 2...2 Qn, 
then the function (5.19) is a probability distribution function in the unit square 
0< 4,22 <1. 
Proof. The only property that we need to show is that 


8° F(z, 29) 


> 0, if 0 1. 5.21 
ae On > 0, if0< 2,22 < ( ) 
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The other properties of a two-dimensional probability distribution are satisfied. 
Introducing the notation: 


N 
» = > opze il zi, 
f=1 
the function F can be written as F = 1/)> . By differentiating we obtain 


OF (2,2) _ 2 ayay t.-o"3:, 


021029 De Oz; Oz yy? 02;02zq° 
The requirement that this be non-negative is equivalent to the following in- 
equality: 
20y5 ay = 
Oz 029 


or in a more detailed form: 


N 1 N a 1 
TOE vs ee cae Jt 52 
25° cpa 24 2q Yo cjayazy 2q > 

f=1 j=l 


N 

a (2 @yy-1 ayg-l 

>> ez 1 zg! ) Ceayieryoe, “ay? 
f=1 j=l 








7 > 


(5.22) 


Multiplying by 2,22 on both sides in (5.22) we get the equivalent inequality 


N 
; 11 Oy fy Oy 
25 i = 1 cjajy 2p") 29 y Soja a, 2 
7 (5.23) 


Then (5.23) is equivalent to 


are ri =1  ayady > Yanonh : (5.24) 


4=1 


Since 


4>0, ¢=1,...N, A: t-:-+rAn = 1, 


N N 
So ayaa di = Yanks Do 7= 1 aj2h; 


i=1 f=1 


(5.25) 
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is the covariance of the two sequences 


O11, %19,+++,O1N 


O91) 83296++)3N 


where to the corresponding pairs we assign the probabilities \,,\9,...,AN, 
respectively. Assumption (5.20) implies that the covariance (5.25) is nonpositive 
(as it can be seen very easily). Hence (5.24) holds true which is the same as 
(5.20) and the theorem is proved. 
The following theorem is useful when considering probabilistic constraint 
of the form 
Fltiyeess2n)2P) O<H $1, 6= 1.0.41, (5.26) 


where 0 < p < 1 is a fixed probability. 


Theorem 5.7.2. The function F(21,...,2n) is logconcave in the unit cube 
0< 2,...,2, <1. 


Proof. A well-known theorem due to Artin states that the sum of logconvex 
functions defined on the same convex set is a logconvex function on the same 
set. 

Since a;; <0,...,a;n <0,7=1,...,N, it follows that each term 

epzy ten i® 

is a logconvex function in the unit cube, hence the same holds for their sum 
which is equal to 5>. Now F = 1/)> and this implies that F is a logconcave 
function in the n-dimensional unit cube. This proves the theorem. 

Theorem 5.7.2 shows that the set of n-tuples z,,...,z,, determined by the 
inequality (5.26) is a convex set for every fixed probability p. 


5.8 Calculation of Function Values and Gradients 


In this section we consider the problem how to compute the gradient of the 
function F(Tz). It turns out that many special probability distributions allow 
the computation of the gradient of F(T'z) as we illustrate it in two special 
cases which are: the multivariate normal distribution and a special type of 
multivariate gamma distribution. 

Under suitable differentiability assumptions the following equality holds 
true in all cases: 

el = F(z;,7=1,....55 #4 ale) ile), t=1,...,%, (5.27) 

where /; is the probability density function of the random variable €;. 

Let us first consider the case of the multivariate normal distribution. It will 
be convenient to assume that the joint distribution of the variables ¢,,..., €, is 
nondegenerated, furthermore E(€;) = 0,£(€?) =1,¢=1,...,r. Then the joint 
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probability distribution function is $(z;R) where R is the correlation matrix. 
It is well-known that 


BPR) gf BIH Gat, AR, | ola) (5.28) 
Oz 1— 73. 


where R; is the (r —1) x (r —1) correlation matrix consisting of the correlations 


Lae MI Fk=1,...77 Fi,k FG; (5.29) 


85k = og 
/1 - rivi - ri, 


and ¢ is the one-dimensional standard normal probability density function. It 
turns out that the gradient of @(z;R) can be computed in a similar way as 
the function value @(r;R). The same subroutine can be used in the r — 1 and 
r-dimensional cases, respectively. 

The second example is the multivariate gamma distribution introduced in 
(8). Suppose that the random vector € has the form 


€=An (5.30) 


where A is an r x (2"— 1) matrix the columns of which are the different nonzero 
vectors having 0,1 components and 7 is a 2"—1-dimensional random vector with 
independent, standard gamma distributed components (some of them may be 
equal to 0). Then the conditional probability distribution function in formula 
(5.27) can be written in the form 


P(é <q, be < 2 (61 = 2) = 
=P(ES +E < 22) E+E < yl = 21) = 





(1) (1) 

=P (2, el”) < 20,0005 21 + €(?) <2z,|6 = 21) = (5.31) 
(1) (1) 

=P (2S + e) < faye ti +60) < z-|1 =*1), 


where ‘ a) 
‘ = by = peaegee? = €, — €) 


and eee (1) are the sums of the joint 7 terms of &,...,€, and &, re- 
spectively. Thus the conditional probability distribution function equals the 
unconditional probability distribution function of the sum z,8+ 7, where ¥ has 
an r — 1-dimensional multigamma distribution of the same type that € has and 
# has similar structure but instead of partial sums of standard gamma variables 
now we use partial sums of components of a random vector having Dirichlet 
distribution. Moreover, § and ¥ are independent. 
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5.9 The case of Discrete Probability Distributions 
The following problem will be considered 
minimize c’z 
subject to F(z) >p, (5.32) 
Te>2z,Be2>6, 
where F is the probability distribution function of the random vector €. If 
€ has possible values z,,...,zy such that all positive values of F are among 
F(z,),...,F (zy), then the above problem is equivalent to the following mixed 
variable problem 
minimize c’z 
subject to yi F(z1) +---tynF (zn) 22, 


Yit..-tyw =1ly,...,yn > 0,1] integers, (5.33) 
Te >yi21 +::: +YNn2n 
Baz > 6. 


Taking a random vector uniformly distributed in the n-dimensional unit 
cube and discretizing it by a step length h which is chosen in such a way that 


1—nh=p (5.34) 


Vizvari [15] proves that the number of lattice points satisfying the probabilistic 
constraint is equal to 

an 

n 


which is a large number for a large n but small as compared to all lattice points 
(of distance h) in the unit cube, e.g. if p = 0.95 and m = 5 then h = 0.01. The 
total number of lattice points is 5!°! whereas the number of those which satisfy 
the probabilistic constraint is only sony 

Computational experiments show that handling problem (5.32) in the form 
of (5.33) provides us with satisfactory solution methodology if n is not very 
large. 

Another mixed variable formulation will be illustrated in the case when & 
is a two-dimensional random vector the possible values of which are nonnega- 
tive lattice points with coordinates < N,M, respectively. The mixed variable 
reformulation of the problem is the following 


minimize c’z 
subject to pooyoo +:::+PNoyNno + Po1yor +°°° + 
Pyiyni t+**'PomYoM +°*:+PNMYNM 2 Dy 
yoo t::*+Yyno = 21; 
Yoo t+ ¥o1 +°:* + Yom = 22; 
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Yik SYj-ik, t=1,...,N; &=0,...,M. 
Vik SYik-1, 2=0,...,N; &K=1,...,M. 
Yik =O orl, for all z,4 and z1 < Nizo <M. 


These models can be used in connection with continuously distributed random 
vector € too when approximating its distribution by a discrete distribution. In 
the higher dimensional case, however, the number of 0,1 variables becomes too 
large. 
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CHAPTER 6 
STOCHASTIC QUASIGRADIENT METHODS 


Yu. Ermoliev 


As it follows from the brief discussion of the Chapter 1, the main purpose 
of the stochastic quasigradient (SQG) methods is the solution of optimization 
problems with a complex nature of objective functions and constraints. For 
the stochastic programming problems, SQG methods generalize the well-known 
stochastic approximation methods for unconstrained optimization of the expec- 
tation of random function (see for instance Wasan [45]) to problems involving 
general constraints and nondifferentiable functions. For deterministic nonlinear 
programming problems SQG methods can be regarded as methods of random 
search (see for instance [42], [67], [68]). 

The purpose of this chapter is a discussion of the main direction of devel- 
opment of SQG procedures, their applications and an overview of ideas involved 
in the proofs. The contents of this chapter is close to that of the paper [69]. 


6.1 The General Idea 
Consider the problem of minimization: 


minimize F°(z) (6.1) 
subject to Fi(z) <0,c=1:m, (6.2) 
zEXCR, (6.3) 


To start with, let us assume that the functions F”(z), =0 : m are convex. 
Then for every z we have the inequality 


FY’ (z) —F" (2) > (FY (z),z-2), Vee X, 


where FY is a subgradient (generalized gradient). We denote as OF” (2) the 
whole set of subgradients at z—the subgradient set. In stochastic quasigra- 
dient methods the sequence of approximates 2°, s = 0,1,... is constructed by 
using statistic estimates of the F” (2°) and FY {x*)—random numbers 7,(s) and 
vectors €”(s) which in average are close to the F” (2°), FY (z*). These quantities 
are constructed by using information about the past history of the optimiza- 
tion process, generated by the path (z°,...,2°) and some other variables, for 
instance the Lagrangian multipliers. We denote this history as B, and for the 
sake of simplicity we usually assume that it is the (z°,...,2°). Then for the 
nv(s), €”(8) we have the conditional mathematical expectation 


E{nv(s)|2°,...,2°} = F’(2") + a.(8); (6.4) 
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E{é’(s)|2°,...,2°} = FY (2°) + 6’(s), (6.5) 


where the numbers a,(s) and the vectors b”(s) may depend on (2°,...,2°). 
For exact convergence to an optimal solution, the values a,(s), ||b”(s)|| must 
be small (in a certain sense) when ¢ — oo. At some time we must have that 


ay(s) — 0, }6"(s)|| +0 (8.6) 
directly or in such a way that 
FY’ (2*) — F" (2°) > (E{é"|2°, 16,8}, 0* — 2°) + 7,(8), (6.7) 


where 7,(8) — 0 as s — oo and 2* an optimal solution. The vector €°(8) is 
called a stochastic quasi-gradient when b’(s) # 0, or stochastic subgradient, 
stochastic generalized gradient (stochastic gradient for differentiable function 
F’(z)) when 6” (se) =0. 

It turns out that for many important classes of optimization problems with 
functions F”(z), v = 0: m of a complex structure it is much easier to generate 
statistic estimates y,(s), €”(s) then to calculate exact values #'”(x*) and its 
subgradients Fy (2°). For stochastic programming problems when 


F’(c) =Ef’(2,w), v=O:m (6.8) 
typically one can take €” (6) equal to a subgradient (gradient in the differentiable 
case) of f”(-,w) at 2° 

€’ (8) = fF (x* ww) (6.9) 


where w® is an observation of w, since usually with an appropriate definition of 
the subgradient-set, we have 


AF" (2) = i. 84” (2,0) (dw). 
More generally 
y 1 ae vie ,.ak 
(0) = 5 Ae (20) 
® p=1 


with a collection of independent samples w°k, k =1:N,, N, > 0. Similarly we 
can take 
nv(s) =f" (2°, w*) (6.10) 


or more generally x 
nv (8) = Wy ete 
since according to the definition of Factions FY (z) 
FY (2*) = E{f" (2°,w)|2°}. 


We consider different special rules for computing €,(s), 7,(#) in Sections 6.7— 
6.13 
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6.2 Methods for Convex Functions 


6.2.1 The Projection Method 


Suppose we have to minimize a convex continuous function F°(z) nzExc 
R", where X is a closed convex set such that a projection on X can easily 
be calculated: 7x(y) = argmin{||y — 2{/? : 2 € X}. For instance, if X is a 
hypercube a < 2 < 6, then 1x(y) = max[a,min{z, b}]. Let X* be a set of 


optimal solutions. The method is defined by the relations: 
2° +1=7x[2° — p,€°(s)],¢ =0,1,... (6.11) 


F°(2*) — F°(2°) > (E{é°(e)|z°,..., 2°}, 2% — 2°) +y0(8), (6.12) 


where 9, is the step size, yo(e) may depend on (z°,...,2°),2* € X*. Let us 
notice, that if vector €”(#) satisfies (6.5), then 


4°(8) = —(b°(e),2* — 2°). (6.13) 


This method was proposed and studied in [1]~[S], [5]. If €°(#) = F2(2*), we 
obtain the generalized gradient method which was suggested by Shor [34] and 
was studied by Ermoliev [$5] and Poljak [86]. If X = R”, 


F°(z) =Ef*(z,w), 


eo) SL tt) = Plots, 
j=l : 
then the method suggested by (6.11) corresponds to the well-known stochastic 
approximation methods which were developed by Robbins and Monro, Kiefer 
and Wolfowitz, Dvoretsky, Blum and others (see [45]). 

It was shown that under natural assumptions, that are also those of interest 
in practice, the sequence {z°} defined by (6.11), converges to a set of minimum 
points of the original problem with probability 1. The proof of this fact is based 
on the notion of a stochastic quasi-Feyer sequence [8]. A sequence {z°}20y is a 
Feyer sequence for a set Z C R", if [66] 


jz —2°t? || < |Jz-2°], Vee Z. 


A sequence of random vectors {z°}%., defined on a probability space (8, F, 4) 


is a stochastic quasi-Feyer sequence [8] for a set Z C R", if El]z°l|? < 00, and 
for any z € Z 
Ez — 2°7 PP |2°,..., 2°} < flz— 2°? tre, @ =0,1,... (6.14) 


oo 


ta 0,9. Er. < oo. 


a0 
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Theorem 6.1. [5, p.98]. If {z*} is a stochastic quasi-Feyer sequence for a set 

Z, then: 

(a) the sequence ||z — z ,¢ = 0,1, converges with probability 1 for any 
2€Z,E||z— z°|/? <C<aM, 

(b) the set of accumulation points of {z°(@)} is not empty for almost all @, 

(c} if 2'(0), 2” (0) are a two distinct accumulation points of the sequence {z*(#)} 
which do not belong to the set Z then Z lies in the hyperplane equidistant 
from the point 2'(6), 2" (8). 


ete |e 


In the simplest case when 7, is independent of (z°,...,z°) the fact (a) would 
follow from convergence of super martingale 


uF ljz = 2°? + S tks Ue 2 0, 
k=e 
E{ve+ifve} < ve. 


The (c) follows from the equality 
lz — 2"||? — lle — 2"? = 2(2,2” — 2") + fle}? + fe"? = 0. 


Consider now a simpler version of the convergence theorem for the iterative 
procedure (6.11) to illustrate the techniques of proof. 


Theorem 6.2. Assume that 

(a) F°(2) is a convex continuous function, 

(b) X is a convex compact set, 

(c) Parameters p,, 70(#) satisfy with probability 1 the conditions 


ps0, >. Pe =O, >, E{pa|r0(s)|+ o2|]€°(@)|7} < 00, (6.15) 
=0 s=0 


Then limz® € X* with probability 1. 


Consider function F°(z) = Ef°(z,w) with uniformly bounded in X second 
derivatives. Then for 


3 rs f°(2e°+A Aek w°*) — f(a! w°?) 
0 ee ce 8 U ? r) 
€°(e) = a DB pee a a h*k 


we have 


E{€°|e°} = F2 (2°) +O(A,), 


where {w°°,...,w**} is a collection of w-observations independent of (2°,..., 
z°) and {h*!,...,h°"*} is a collection of observations of vector h = (Ay,...,/n) 
whose components are independently and uniformly distributed over [—1, 1]. In 
this case condition (6.15) signifies that numbers p,,A,, which may depend on 
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(2°,...,2°), must be subjected to the conditions (taking into account the (6.13) 
and boundness of X: 


Lee) oo 
pe 20, >, pp = 00,9, E( pee + 92/A2) < 00; 
a—0 s—0 


=1/2,A4,= e 1/4) for any O0< <4} are such sequences. 
Let us notice that if we take 


hek 


f9(2* + Aho, w) — f° (2%, w) 
n=2y A. 


and f({z,w) satisfy Lipshitz condition within respect to z uniformly over u then 
E||é° (8) |? < const < oo 


when random parameters have finite distribution and z° € X. In this case 
condition (6.15) leads to the following requirement on f,, A,: 


0 00 
po 2 0,>, os = 00, E (pode +5) < &. 
a=0 6—0 


In what follows we often make the assumption that E]|€°(s)||? is bounded for 
simplicity of restrictions on p,, A,. Such an assumption is not too stringent 
for most applications. In practice it is the consequence of (b) and the fact 
that estimates of subgradients are often unbiased and distributions of random 
parameters are finite. 
Proof of Theorem 6.2: The properties of the projection zx yield for any 
aex 
E{|lz* — 2°t!|/7|2°,...,2°} < |l2* — 2° ||? 
+ 2p.(E{E°(s)|2°,...,2°},2" — 2°) 
+ poE {NE (2) IP 2°, one"). 


By the assumption (c) and (6.12) (taking into account that F(2*)— F (2°) <0) 
E{\le* — 2°**)?|2°,...,2°} < lla — 2°)? + C(oolr0(#)| + o5 lle? (#)I)"), 


where C is a constant. 

In view of (6.15) and by the definition (6.14), it means that {z°} is indeed 
a stochastic quasi-Feyer sequence for the set X*. Consequently, the sequence 
||2* — 2°] , ¢ = 0,1,... converges with probability 1 for any z* € X*, the set 
of accumulation points of {2°} is not empty. If we show that one of the accu- 
mulation points of {z°(@)} belongs to X* for almost all @, then from assertion 
(c) of Theorem 6.1 would follow the convergence of {z°} with probability 1 to 
a point of X*. 
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Consider the inequality 
8 
Eljz* — 2°) ||? < Elj2* — 2°||? + 2E > on {EX €°(k)|e°,...,2*}, 2% — 2) 
k=0 


+057 Egle (A). 


k=0 
Due to the inequality (6.12) 
8 
Elje* — 2°"? < Elje* — 2°? +2E D° pe (F°(2*) - F°(2*)) 
k=0 


+ OD El oelolB + le HIP} 


k=0 


from which we get 
oo 
E> px(F°(2*) — F°(2*)) < 0c. 
k=0 


Since 


foe] 
2 pk = 00 and F°(2*) — F°(2*) <0, 
k=0 


there exists a subsequence 2** such that F°(2*) — F(z**) — 0, and this com- 
pletes the proof. 

The methods which we shall consider below, converge under conditions 
approximately analogous to those mentioned above. Theorem 6.2 establishes 
the convergence of the iterative procedure (6.11) with probability 1. Such a 
convergence is important in many applications. If y9(s) = 0 and if instead of 
(6.15) only the conditions 


oo 
ho | 0.) b= oo 
6=—0 


hold, then it can be shown [5], that 


inf E]|z* — 2°||? +0. 
x* 


In [63] the following idea was proposed for estimating the efficiency vector 


(Eo) (Ee) 
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From the inequality 
Bljz* — 2°? < Bljat — 2°]? +26 > pu (P%(2*) - F(z") 
k=0 


re se EX px|v0()| + ell? (&)II7} 


k=0 


we have that 


By oe(F°(2*) — F°(2*)) < Ell" — 29)? 


+O 5° Ef oe|vo(#)| + efile? (#)|7} 
k=0 
If the pg are independent of (x°,...,<*), then 


(5: n| ESS ox(F9(ot) — F(z") > EF (@) — F2(2*) 


k=0 


and we have such estimation 
, -1 
EF®(z*) — F°(2*) < (23: n| 
k=0 


[ete aa a CY (oelrol)| a ale we 


k=0 


6.2.2 The Lagrange Multiplier Method 
The method is characterized by the relations 


gett 


mx(2* — pe le (2) + > ue a). 


f=1 


max{0, u;(8) + 667; (8) } 


we 


and when X = R", 5, = ps = const, €”(#) = FY (2°), ni(6) = F'(2*),i=1: m, 
and the f”(z), v = 0: m are smooth it is a deterministic algorithm proposed 
in [52]. The stochastic version of this method was studied in [1], [5], where 
it was proved that the ming<, F°(z*) converge to min F°(z) with probability 
1, provided that F°(z) is strictly convex and 5, = p,. The convergence for 
convex functions F°(z)-—not necessarily strictly convex—was studied in [21] 
with assumptions that p,/5, — 0. 
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6.2.3 Penalty Function Methods. Averaging Operation 


Constraints of type (6.2) of the general problem (6.1)—(6.3) can be taken into 
account by means of penalty functions and instead of the original problem, we 
can minimize a penalized function, for instance 


¥(2z,c) = F°(z) +c)  min{0, F'(z)} 


i=1 


on the set X, where c is a big enough number. A generalized gradient of ¥(z,c) 
at z= 2° is 
m 
F2 (2°) +e S> min{0, F"(2*)} Fi(2°). 

i=l 
If the exact values of F'(x*), F°(x°), Fi(z*) are known, then a deterministic 
generalized gradient procedure can be used for minimizing #/(z,c). The penalty 
function methods for a problem with known values of the constraint functions 
F'(2*) was considered in [46], [63]. In such cases the projection method (6.11) is 
applicable to minimizing ¥(z,c), since the estimate of the subgradient f;(z*,c) 
is vector 


€°(s) +e bs min{0, F; (2°) }é (¢). 


In general, if instead of the values F’(2°), F2u, v = 0: m, only statis- 
tical estimations 7,(#), €”(¢) are available, it is impossible to actually find 
min{0, F'(z*)}. How to handle this situation was studied in [4], [5]. 

Consider the following variant of the iterative scheme studied in the pre- 
vious section. 


m 
at! = 7yx(2° — pg[€°(8) + c)_ min{0, F;(s)}é (e)}), (6.16) 
f=] 
F;(¢ +1) = don:(¢ +1) + (1—4.)Fi(s),¢=1:m, (6.17) 
where #, is the step-size, 0 <<, <1, Fi (0) =7,(0), 
E{n;(8)|2°,...,2°} = F'(2°) +.4;(¢), 
FY (2*) — FY (2°) > (E{é"(s)|2°,...,2°},2" — 2°) + (8). 
For convergence with probability 1 of these kinds of procedures in addition 


to (6.15), we must demand that with probability 1 


co 
Po > 0, pe/ ve = 0,5) Ey? <M, 


s—0 
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D> Ef e6\1i(8)| + Y.|a;(8)|} < 00, =1:m 
4=0 


It is worthwhile to note that the above mentioned method may not converge 
when #, = 1. Le., for F;(e) =, (8). If 6, =1/(¢+1) then 


F,(s) = Ns ni(k)- 
k=0 


This is why the (6.17) was called the averaging operation. In the case when 
F'(z) = Ef'(z,w), 


F;(¢ +1) = def! (2°*},w* +1) 4+ (1- tbe) Fi(s),t =1:m (6.18) 

The averaging procedure proved to be very useful of stochastic and non- 

differentiable optimization, the following general fact is decisive concerning this 

operation. Consider the auxiliary procedure (6.17) itself for a given sequence 
{z°}02.9. The procedure (6.17) has the following general form 


B(e +1) = A(s) — ¥.[P(s) — n(¢+1)],¢ =0,1,... (6.19) 
where ¢, is B,-measurable function and 7(s) is a random observation of a vector 
V (8): 

E{n(s)|B,} =V (se) +a(s), (6.20) 
which in the case of method (6.16) (6.17) takes on the form V(s) = F(z’) = 


(Fl (x*),...,F™(x*)). Under rather general assumptions (see, for instance 
[10]], p.46) provided that with probability 1 


IV (0+ 1)-V(o)]|/¥. 0 (0.21) 
Ye 2 0, 57 Ev? < oo, lJale)l/¥. 0 (0.22) 


it can be shown that with probability 1 
|2(8) — V(8)|| + 0 for s+ (6.23). 
Therefore the #(s) estimates vector V (s) with increasing precision and we 
can “substitute” unknown V(s) by 8(e). If F(z), 7 = 1: m are Lipschitz con- 
tinuous functions in X and points z°+1, 2° are connected through the equation 
(6.16), |7;(8)| < const, ||E”(s)|| < const, 7 =1:m,v =0:m, then assumption 

(6.21) follows from the condition 

Pe/%s — 0 for & + co (6.24) 


with probability 1. 
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The assertion (6.23) has close connections to the general theorem 5 con- 
cerning the convergence of nonstationary optimization procedures, since the 
step direction 


2[A(s) — n(s + 1)} 
of the (6.19) is the stochastic quasigradient of the time-depending function 


&°(8) = Ell — (s+ 1)|)? (8.25) 


at f = f(s). 

The averaging operation enables us to elaborate many stochastic analogues 
of known deterministic methods. Gupal [8] has studied the following stochastic 
version of the deterministic procedure described in paper [36]: 


o+1 


at! =ay[2* — p,¢*], (6.26) 


0 +. ite rae 

ed at Bese) par, P(e) <0, 
‘'*, if F;, (#) > 0. 

The requirements for convergence of this method are similar to those for the 

method (6.16). 


Consider now some other methods for which the averaging operation ap- 
peared to be crucial. 


6.2.4 Mixed Stochastic Quasigradient Method 


Bajenov and Gupal [25] were first to apply the averaging procedure to step 
directions. The method is defined by the relations 


a°t! = ay[x° — p,d°| (6.27) 
d?t! = §,€°(6 +1) + (1 —6,)d° =d® + 6,[€°(¢ + 1) —d°], (6.28) 
E{é°(e)|2°,d°,. .,2°,d°} = Fo (z*) +6° (s), (6.29) 


s = 0,1... with initial d? = €°(0). Such types of methods have also been 
studied in [10], [70], [71], [73]. 

The sequence {2°} converges with probability 1 to an optimal solution 
provided that in addition to requirements a), b), c) of the theorem 2 the scalars 
Pe, 5g are chosen so as to satisfy with probability 1 

Ps 2 0,5 > 0, po/ 5, — 0, ||b° («)|| + 0, (6.30) 


bs ps = oo, E(p? + 82) <a. 
6=0 6—0 


The vector d® defined the recurrent formula (6.28) is called the averaged, 
aggregated, or mixed stochastic quasigradient. 
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6.8 Nonconvex Nondifferentiable Functions - Finite Difference 
Approximations Schemes 


The convergence of SQG methods for nonconvex objective functions and con- 
straints functions have been studied by many authors (see [5], [10], [12]). In 
[12], Nurminski generalized method (6.11) for the case of nonconvex and non- 
differentiable objective functions satisfying the inequality 


F°(z) — F°(2) > (Fz (2),2 — 2) + o(llz — 2])) 


when z — 2 for all z from a compact set. Such functions are called weakly 
convex. The class of weakly convex functions includes convex functions as 
well as nonconvex differentiable. Moreover, the maximum of a collection of 
weakly convex functions is also the weakly convex function. Significant results 
in elaborating SQG methods for the nonconvex and nondifferentiable functions 
were obtained in [9], [10], [$1]. In these papers the following stochastic versions 
of the finite difference approximation schemes were proposed. 

If values of the functions F’(z),v = 0: m can be easily calculated and 
F(z) are differentiable functions, then there exist methods using a finite dif- 
ference approximations of the gradients FY (x*) at current point 2°: 


n YG I} _ Fv ( 8) 
Fi(2*) ~ > Filer thee) Pee, (6.31) 
j=l : 
n v8 1) PV (mo Jy. 
FY (a) ~ 27 (x to (a Ave’) 5 (6.32) 
j=! 


where e is the unit vector on the j-th axis and A, > 0. Although the fi- 
nite difference approximations exist for nondifferentiable functions, the use of 
them does not guarantee the convergence of optimization procedures. The pro- 
posed modification of finite-difference approximation schemes consists a slight 
randomization of them: 


FY (2*) hey £”(8) = > meee Ey, (6.33) 


a 


“. FY (z) + Aye) — FY’ (z! — Age’) , 


FY (2°) ~ €" (8) = x e’, (6.34) 


j=l 
where Fy (z)is a subgradient; F = (27 + hf,...,2% +h¢,...,08 +h9), BY = 
(af +hf,...,0; —V+h$_ 1, 25,084, thf i 1,---50, +ha)sg = Ln and hé are 
independent random quantities uniformly distributed on interval [-4¢, 4a, 
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The convergence of corresponding optimization procedures is based on the 
fact that with probability 1 
bid Vv (x67 } Vv (=6z } 
min ||E > ale ck a i eee — F¥(z*)|| +0 
PY (#) = A, 

(6.35) 

when A, — 0 and F”(z) are local Lipschitz functions. Therefore vectors €” (s) 
defined by the (6.33), (6.34) are also statistical estimates of the subgradient 
F¥ (2°), satisfying general requirements (6.4), (6.5), (6.6). 

For stochastic programming problems when FY (z) = Ef’ (z,w), we have 
analogues of the (6.33), (6.34) 


€’(6) = - e+ Ate) aa (th) (6.36) 


j=l 


". fY(E9 + Ave w%) — fY(F9 — Age yw? nty) ; 

e@=yf ( + af; ) o ah Wy, +3) (6.37) 
jal 

which also satisfy the relation (6.35). 

Different generalizations of SEG methods to the case of local Lipschitz 
functions F”(z) making use of the (6.33), (6.34), (6.36), (6.37) type approxi- 
mations have been studied in papers [10], [73], [74]. 

Let us discuss the general idea of such procedures with more details. 


6.4 Simultaneous Optimization and Approximation Procedures, 
Nonstationary Optimization 

Suppose we have to minimize a function {°(z) of a rather complex nature, for 
example, it does not have continuous derivatives. Consider the sequence of 
the “good” functions {F°(z,s)}, for instance smooth, converging to {°(z) for 
3s — oo. Now consider the procedure 


gett = 2° — p.F2 (2, 8),¢=0,1,... (6.38) 


Under rather general conditions (p, | 0,2. = 00) it is possible to show (see 
[5], [14], [17] and Theorem 6.3) that F°(z*, 8) — min f°(z). 

Often approximate functions may have the form of mathematical expecta- 
tions 


F"(z,8) = i f(z +h)P,(dh) = Ef°(2 +A(s)), (6.39) 


where the measure P,(dw) for s + oo is centered at the point 0. Hence instead of 
the procedure given by (6.38) that requires the exact value of the gradient of the 
mathematical expectation, we can use the ideas of the stochastic quasigradient 
methods. 
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For example, see [9], let h(e) be random vectors with independent com- 
ponents uniformly distributed on [—A,/2,A,/2], A, — 0 for ¢@ — oo, and 
suppose that f°(2,s) is continuous differentiable and F°(z,e) — f°(z) uni- 
formly on any bounded domain. Consider the stochastic procedure with the 
(6.34) type approximation 


g?t} = 2 — peé’(s),8 =0,1... 


€(8) = > LP (E* + Age?) = f0(F = Ase’) (6.40) 
j=l 


It can be shown that 
E{€°(e)|2*} = F2(2°,e) 


where F°(z, 8) is defined by (6.39). 

In other words the method (6.40) is a stochastic analogue of the method 
(6.38). Procedures (6.38), (6.40) are examples of simultaneous optimization and 
estimation procedures. The development of such procedures is connected with 
the following general problem of nonstationary optimization [15]-[20], [53], 
[75]. 

The objective function F°(z, 6) and the feasible set X, of the nonstationary 
problem depend on the iteration number e = 0,1,.... It is necessary to create 
a sequence of approximate solutions {z°}, that tends, in some sense to follow 
the time path of the optimal solutions: for s — oo, 


lim[F° (2°, s) — min{F°(z, «)|x € X,}] =0. 


The case when there exist lim#°(z,s) and limX, (in some sense) for ¢ > 
oo was called the limit extremal problems [14], [17], [5]. The optimization 
problems with time-varying functions and known trend of the optimal solutions 
is considered in [68], [54], [60]. 

To illustrate the ideas involved in the proof of convergence results, let us 
consider the following simple case: 


Theorem 6.8. Assume that: 

fa) F°(x, 6), f°(z) are convex continuous functions, 
(b) X is a convex compact set, 

(c) F°(2,6) > f°(x) uniformly in X, 

(a) [2 (2°, 6) < const, 


a°t} = ry[x — p, Fo (2°, 8)| (6.41) 


and the parameters p, satisfy the conditions 


co 
+0, be = 00 
s=0 
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Then F°(2°,8) — f°(2*) = min{f°(z)|z € X}. 

The principal difficulties associated with the convergence of procedure 
(6.41) are connected with the choice of the step-size ps. There is no guar- 
antee that the new approximate solution 2°*! will belong to the domain of the 
smaller values of functions F° (x,t) for t > #+1 (see Figure 6.1). Therefore even 
for X = R” and differentiable (continuously) functions F(z, 8), the (6.41) is 
essentially nonmonotonic optimization procedure. There is one more difficulty. 
In the general case without the assumption c), the aim of {2°} is to track the 
set of optimal solutions 


Xt = {2|F°(z, 8) = min F°(2,s),2 € X,}. 


Unfortunately the Hausdorf distance between Xf and X$,, may be large even 
for small distance between F°(x,s) and F°(2,¢ +1), as it shows in the Figure 
6.2. 








f ~. 
\ x a \ F°(x,s) 
eae eS FO(x8,) / 
ee N FO(x,s+1) 
4 — ' 
{ I 
| { 
' 1 
Here rnarrerqerrrreseientnng > 
Xt Xe 
Figure 6.1. Figure 6.2. 


The convergence study of the (6.38), (6.40), (6.41) type procedures in gen- 
eral case involves the sets of c-solutions (see [18], [75], [76]). 

The essentially nonmonotonic solution procedures need an appropriate 
technique to prove their convergence. Often the necessary analysis can be based 
on the following result [5], [11]. 
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Theorem 6.4. [5, p. 181] Suppose that X* C R" is closed and {z*} is a 
sequence of vectors in R" 
(1) for all 2, 2° € K, with K compact 
(2) for any subsequence {z°k} with limz*k = 2’ 
(a) if 2’ € X*, then |]z°#*) — 2*k|| + 0 as k + 00 
(b) if 2’ ¢ X*, then for © sufficiently small and for any #; 


Th = min{s|s > s,, ||2°* — 2°] < €} > 00. 


(3) there exists a continuous function V (z) attaining on X* an at most count- 
able set of values and 


lim V (27*) < lim V (2°). 
k— 00 k-+00 


Then the sequence {V (z*)} converges and all accumulating points of the 
sequence {x°} belong to X*. 


The conditions of this theorem are similar to necessary and sufficient con- 
vergence conditions, proposed by Zangwill (see [65]). However, Zangwill’s con- 
ditions are very difficult to verify for a nonmonotonic procedure. 

Conditions (2) of Theorem 6.3 prevent all sequence {z°} converge to limit 
point 2’, which does not belong to the set X*. However, condition (2) alone 
does not prevent “cycling”, i.e., such a behavior of {z°} that it will be visiting 
any neighborhood of x’ ¢ X* infinitely many times. To exclude such a case 
the condition (3) is imposed, which guarantees that the sequence {z°} will be 
leaving a neighborhood of z’ with decreasing values of some Lyapunov functions 
V (z). Let us now illustrate the use of this theorem. 


Proof of Theorem 6.8: The conditions 1,2(a) of Theorem 6.4 are fulfilled. It 
suffices to verify the conditions 2(b) and 3. Let 2° — 2’ € X*, we need to show 
that 7, < 0c. We argue by contradiction, to suppose the contrary that 7; = oo. 
For this purpose, we consider the function V(z) = min,+*¢x* ||z* — z||?. We 
have that 
V (203) = gin, fet ot? = fat (0 +1) — 21 < fot (e) 2? 
= V(x") + 2p. (Fz (2°, #),2° (8) — 2°) + o5\|Fs (2°, 8) |”. 


Since 2*k —+ x2'EX®* and ||z° — 2°k|| < ¢ for sufficiently large ¢ and any ¢. Then 
there exists 6 > 0 such that 


f*(2*) — f° 2") < -6 
and for z* € X* we have 
(F) (2°, 8), 2" —2°) < F°(2*,s) — F°(2",e) 
< F(z", 8) — f°(2*) + f°(2*) — F°(2*, 8) 


ot 
2 
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Therefore 
V(a**!) <V(2*) — 5p, +093 = V(x") — p.(5 — cps) 


$ 6 
SV (2%)—5 YS pe 
ee, 


and for a sufficiently large ¢, this contradicts the fact that |V (z)| < const when 
xz € X*. So, condition 2 is satisfied. Looking at condition 3, it is easy to realize 
that 


‘28 
V (ert) <¥ (ett) 2 pe 
es, 


Hence, in view of the properties of z,, 


Tp 1 Tp-1 
e < fle —2%|| < So feet! -2 1 <C D on, 
S=6p 6-8, 
where C is a constant. Then 
ze) 


V (27k) <V (2°) - ae 


or equivalently 8 
lim V (2*) < limV (2°*) 


and this completes the proof. 
Consider now more general procedure 


tt! = ax|2° _ pot (8)]; §= 0, 1, ees (6.42) 


E{€°(s)|2°,...,2°} = F,(2*, e) +.°(6) 


Theorem 6.5. [19] Assume that 

{a) F°(2,8) are convex continuous functions, 

(b) X is a convex compact set, 

(c) nay [F** (2) — F(a) < 5, EC] < const, 


(d) with probability 1 


00 co 
55] Ps aa 0, ||2°(s) || — 0, p52 0,55 Ps = co, Ep < oo. 


e=0 s=0 
Then with probability 1 


|F° (2°, 8) — min{F°(z, s)|¢ € X}| +0 for @ + ow. 
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6.5 Feasibile Directions Methods 

Consider the minimization of a continuously differentiable function F°(z) in a 
compact convex set X. If F°(z*) and F)(2*) are known, then the standard 
linearization method is defined by the relations 


gett =? + p5(# - 2°), 
(Fz (2°), 3*) = min{{F? (2°), 2)|2 € X} 
F°(2°t!) = omin, F°(2° + p(# — 2°). 


The stochastic variant of this method has been studied in [5}, [6], [10], [$0] and 
is defined by the relations 


atl =o 4», (z — 2°), (6.43) 
(d*, Z*) = min{(d’, z)|2 € X} 
d°+! = §,¢°(6 +1) + (1—6,)d° =d° + 6,[€° (6 +1) - 4d‘), 


where p,, 5, satisfy conditions similar to those of Section 6.2. Notice that if 
instead of d® the vectors €°(#) are used (5, = 1) then, some simple examples 
show that the method may not converge. 

The linearization method usually is applied when X is defined by linear 
constraints. In such case this method requires at each iteration a solution of 
linear subproblem in contrast to the projection method (6.11), which requires 
the solution of quadratic subproblem. Let us notice that only small perturba- 
tions occur in the objective function of the subproblem at each ¢ > 0, therefore 
for s > 0 only small adjustments of the preceding solution are needed in order 
to obtain a solution of the current subproblem. 

Consider now the case when 


X = {2|F'(xz) <0,i=1: m}. 


Assume that F’(z),v = 0: m are continuously differentiable functions, the 
set X is compact, and the gradient F0(-) is Lipschitz continuous on X. Let 
sequences {x°} and {v*} be defined by the relations [10], |78]-[80] 


oot) = 2 + pv? (6.44) 
d°*} = d° +8,(€°(¢ +1) —d°),d° = £°(0), 
E{€ (s)|Ba} = Fz (2*) + 6°(8), 


where B, is o-field generated by points {(2°,v°,d°),...,(2°,v°,d°)} and v* is 
a solution of the subproblem: 


max{r]{d°,v) +7 < 0,(Fi(2°),v) +6 <0,¢E1*,-1< vj < 1,7 =1:n}, 
(6.45) 
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I? = {i:-e, < F;(2°) < 0},e, | 0 


Therefore it is assumed that we can calculate exact values Fi(x),i=1:m. 
Consider 
p, = max{p|z° + pu? € X, p> 0} 


and let 
pe = min{ 7), "}, 0% > 0. 
Theorem 6.6. (see [10], p.113) If with probability 1 


oO oO 
Pa 2 0,5. Py = OO; 05 /be — 0,5 £8 < 00,6, — 0, 
s=0 s=0 
E\|@ (6)? < C < 00, Eb" (#)|? < C, Ef |]0° (*)|||B.} — 0 


for some constant C, then the sequence {F°(z°)} converges with probability 
1 and all cluster points of the sequence {z°} satisfy the necessary optimality 
conditions of the problem. 


Ruszczynski [80] modified the method (6.43) for nonconvex objective func- 
tion with the following property: there exist. 6 > 0 and yu > 0 such that for all 
2 €X all z satisfying ||z — 2|| < 6 


F*(z) — F'(2) > (Fe (2),2- 2) - alle — 2], 


where X is a compact set and F2(z) is a subgradient. This class of functions 
is identical with the family of functions, which in some open neighborhood of 
az have a representation [81]: 


F°(2) = max g(z,u), 


where U is a compact and ¢(-,u) has second derivatives continuous in (z, «). 
In the method the following direction-finding subproblem is used instead of the 


subproblem (6.45): 
alts) e TP ene ees) a 
t=1:m,yEX}, 


where F'(z),i = 1 : m are supposed to be convex and differentiable in X 
functions, X is a convex compact. If y° is a solution of the subproblem then 


yi =yi—2 


is used in equation (6.44). The convergence theorem is similar to that of the 
method (6.44) provided in addition to the mentioned above alternations that 
with probability 1 


b°(s) = 0,5, = ap.,0 < ps < min(1,1/a) 
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Yona DB <a, 


6=0 


where scalar p, may as usually depend on the past history generated by the 
(2° sd") .<.4 2%, d) 

The paper [80] contains a rather general requirement on the choice of 
direction v°, which enables different modifications of the subproblem. In papers 
[10], [380], procedures (6.43), (6.44) were generalized to the minimization of local 
Lipschitz functions making use of approximations (6.33), (6.34), (6.36), (6.37). 


6.6 Adaptive SQG Procedure 


The success of the application of SQG methods depends on the particular rules 
for choosing their parameters—step sizes and step directions. The general con- 
vergence theorems provide a wide freedom in choosing them adaptively as a 
functions of the (random) history B,, for instance (2°,...,2°). What is the 
best choice? 

The behavior of SQG methods is unusual as compared with deterministic 
methods. The convergent with probability 1 sequence of approximate solutions 
{x*} defines the set of pathes (realizations) leading from the initial point 2° to 
the set of optimal solutions (Figure 6.3). 

In the case of unique solution the procedure may approach a neighbor- 
hood of the solution in different ways. The choice ¢, = 1/8 serves all pathes in 
the same way, independently of the current situation and cannot be the best 
strategy. Of course the definition of the best strategy is the consequeitce of 
the performance function definition. If the performance function is defined on 
the whole set of pathes and if this function deals only with the asymptotic 
behavior, then the choice , = 1/¢ with the appropriate constant a depending 
only on the unique solution might be the best opportunity (see pioneering pa- 
pers [84], [85]). Unfortunately this conclusion about the “optimality” of the 
f, = 1/6 mislead in the use of stochastic approximation type procedures. The 
asymptotic approach is really rather unsatisfactory for practical application, 
since it does not make any use of the valuable information which accumulates 
during solution, in particular, the starting point. The practical aim usually is 
to reach some neighborhood of the solution rather than to find the precise value 
of the solution itself. SQG methods are quite good enough for this purpose. 
They have been applied to various practical problems (see, for instance, [5], 
[7]) and there always have been used only adaptive principles for choosing their 
parameters (this is discussed in details in Chapter 15-17). 

The adequate choice of the parameters at a nonmonotonic procedure is 
not trivial problem as it shows even the simplest deterministic analogue of the 
method (6.11)—so-called generalized gradient method (see [5], [38]) 


gitl = of — F} (2°) ¢= . 
= Pore] (yy 0,1,... (6.47) 
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Figure 6.3. 


where F2(z*). Since there is no guarantee that the objective function is de- 
creased in the direction F(x) (see Figures 6.4 and 6.5), then for any choice of 
the p,, satisfying the convergence conditions (see Theorem 6.3) 


00 
Ps 0,>— Ps = 00 
6=0 


the sequence {F°(z*)} shows oscillatory behavior with tendency of decreasing 
in the “average”. Stochastic version (6.11) is much more difficult since exact 
values of the objective function are not available. 

A rather general way of changing the p, would be to begin with a suffi- 
ciently large value for the first few iterations, and decrease p, if additional tests 
show that the current point is in the vicinity of the optimum. The averaging 
procedure (see Sections 6.2.2, 6.2.3) appeared to be useful in tests of this types: 


Fy(# +1) =F, (s) + 4.[€"(¢ + 1) — F;(s)] 
F (6 + 1) = F’ (s) + Pelnv(e + 1) -F' (6), 


since min{||F.(¢) — z|| | z € OFY(2*)},[F (s) — F”(2*)| — 0 under rather 
general conditions. 
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FO(x) <S FO (x5) 
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Figure 6.4. Figure 6.5. 


Therefore, the averaging results in the use increasingly precise estimates 
of the gradients (subgradients) and values of the functions without intensifica- 
tion of the observations. To avoid the influence of long tails of the past, it is 
sometimes more useful to adopt the averaging of the type 





say 1 ww 
FE (6,6) = D(H), 
5 p=€, 
1 8 
F’ (0,6) = — 0) (HOS & <8. 
5 b= 0, 


The decision as to whether to change the g, or other parameters (steps of finite 
difference approximations, the smoothing parameters) may then be based on 
two modes: 

e interactive mode 

® automatic mode 
By using the interactive mode it is assumed that the user can monitor the 


progress of the optimization process and can intervene to change the values of 
the step size and other parameters. These decisions should be based on the 
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behavior of the averaged values F’ (¢), F,(¢) and its different combinations 
and must partially be made by the user on the basic of the visually observed 
behavior of these quantities. For instance, in the case when observed behavior 


of F (6) shows a regular oscillations (see Figure 6.6 interval [a,b]). 





Figure 6.6. 


In automatic mode the decisions about changing parameters is made au- 
tomatically on the basic some tests which formalize the actions of “oscillatory 
behavior”. 

There is strong evidence that the interactive mode cannot be completely 
avoided in the stochastic optimization. There is only the question up to what 
extent to develop the automatic mode. The situation here is very much re- 
sembled to driving a car. Of course if road conditions are deterministic, it is 
possible to imagine an automat which drives the car. But since the road con- 
ditions are far away from the well formalized situation, the user himself drives 
the car using some minimal information about its construction. 

Different concrete rules of choosing the parameters of SQG methods adap- 
tively are discussed in Chapters 15-17. 
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6.7 Optimization of Stochastic Systems — General Standard Prob- 
lem 

In this and next sections we are going to discuss some applications of SQG 
methods to the stochastic programming problems when F’ (2) = Ef’ (2,w),v = 
0:m. From the discussion of the Chapter 1 it follows that taking into account 
the influence of uncertain random factors in optimization of systems leads to 
stochastic programming problems of the following standard form: 


minimize F°(z) = Ef°(z,w) (8.48) 
subject to F'(2)=Ef'(2,w) <0, i=1:m (6.49) 
zEX CR, (6.50) 


where FE is the operation of mathematical expectation with respect to some 
probability space (0, A, P). 

The problem (6.48)-(6.50) is a model for stochastic systems optimization, 
when the decision (values to assign to the system parameters) z is chosen in 
advance, before the random factors w is observed. A stochastic model tends to 
take into account all possible eventualities for stablizing the optimal solution 
with respect to perturbations of the data. There may also be a class of models, 
when the decision z is chosen only after an experiment over w is realized and 
z is based on the actual knowledge of the outcomes of this experiment. Such 
situations occur in real-time control and short-term planning. In practice, these 
problems are usually rediced to problems of the type (6.48)—(6.50) via decision 
rules (see Chapter 1). 

Consider some particular formulas for computing the estimates of values 
F’ (2°), F¥(2*). Suppose that it is possible to calculate the value of random 
functions {”(2*,w). Then we can take 


N 
e) = ae ys (e w*) vy =O:m (6.51) 
N, = ’ 7 


where the number N, > 1 may depend on the past random history B, of the 
stochastic procedure—the minimum o-subfield that at least includes c-algebra 
generated by the path {x°,...,2°} and may be some other random pathes 
associated with such quantities as Lagrange multipliers, averaged subgradient, 
etc. The collection fit nc. tried is result from samples of w, which are 
mutually independent with respect to ¢ =0,1,.... By definition we have 


EXnv(s)|Ba} = 5 Sate )|x?} = F*(2°). 


If the functions F” (2) have uniformly bounded second derivatives then for the 
random vectors 


(jay Let ee ott ) ~ f (2,00) 5 (6.52) 


j=l 
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we would have 
E{€’ (s)|2°} = FY (2°) + 5” (8), ||b”(s)|| < const .A,, 


where e? is the unit vector on the j-th axis A, > 0; {(w®,...,w®")}S29 area 
result of independent s = 0,1,..., samples of w (we could have 9? = wel = 
...=w°*"). For the vector 


’ 6 Vig ah®® ek — f¥ (28, 60 
eyed y eee eat (088 
k=1 


where h*!,...,h°°® are independent of B, observations of the random vec- 
tor h = (hi,.--,hn) whose components are independently and uniformly dis- 
tributed over [-1,1]; number r, > 1 depends on B, 


ma | — FY (2°) 


hg 
A, q 


3 _ wn FY’ (2° +A 
E(e¥(«)|B,} = 2e > + 
& k=l 


3 - iv 6 & & & 
= SEY FE (et YMIR + Avesta) = Fale") + Bvasl, 
where |a;(s)| < const. 
For nondifferentiable functions F”(z) typically one can take €”(¢) equal to 
a subgradient of f’(z,w) at 2 = 2°: 


& (z) = fr (2*,4*), 
where w® is a sample of w independent of B,; more generally similar to the 
(6.51): 
mee 
é’(«) = rm > f(a, w*), (6.54) 
§ k=1 


since under appropriate integrability conditions and the definition of the subgra- 
dient-set, we have 


AF" (2) = / f° (0,w)P (dw). 


For recourse and minimax problems referred to in Sections 6.8 and 6.9 
such rules were firstly used in [2], {3], [5]. General framework provide results of 
papers [86], [87], [92]. If the functions F” (z) satisfy a local Lipschitz condition, 
then formulas (6.52), (6.53) can be modified respectively 


(a = Et Adu) = fH 0") (6.55) 
j=l : 


Fe sv (ae 8 8 V (6 ,,60 
TO i a 
® pny 8 
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where 2° = (x? +r?,...,2% +73), random vector 7° = (r?,..., 78) independent 
of B, with uniformly distributed on [—7./2,7./2] components. 
It is easy to see that in both cases (6.55) and (6.56) 


ELE" (e)|x°} = Fy (2*, 6) +6" (s), 


where |/b”(s)|| < const *¢ and FY (z,e) is the gradient of the differentiable 
function 


Fez) = BS (et ra) =k ff pe et yldy 
(26) -1¥8 -18 
with the property 
min{||F’ (x, a) — 2||/z € OF ’(z)} +0 for y, + 0. 
The vector (see [9], [10]) 


é" (2) = s- al Ae ee a a (6.57) 


y=1 
is an unbiased estimate of the gradient FY (z, 8) 
EXE" (2) |2°} = Fy (2°, 8). 


Averaging operations (see Section 6.2) give us new opportunity to build wide 
range of the estimates F’ (e), F,(#) from known, defined, for instance, through 
the (6.51)—(6.56): 


F’(2+1) =F (0) + ¥elnv(o + 1) -F’ (s)], 
F,(#+1) =F (s) + 4,[€"(e +1) ~ Fz (s)]- 


Consider now more concrete classes of the estimates for some particular 
classes of problems. 


6.8 The Stochastic Minimax Problems 


The objective function of the simplest stochastic minimax problem (see [3], 
[5], [18], [$2]) takes on the form 


F°(2) = E max 


b ai j(w)a; +b; | ; (6.58) 


Many inventory models have such type of objective functions. Consider 
the simple example. 
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In a store of capacity 7 it is necessary to create a stock x of a homogeneous 
product whose demand is characterized by the random variable x. The cost 
associated with the stock z on the condition that the demand is equal to y is 
characterized by the function 


0 os a(2 — ); zz, 
fP(eu) = | 5 ses 


f° (z,w) = max{a(z — x), P(x — 2)} 


where a is the unit storage cost and @ is the unit shortage cost. The decision 
about the stock-size z must be made before the information about the demand x 
is available and the minimization of the expected cost leads to the minimization 


of the function 
F°(2) = Emax{a(z— x), B(x — 2)} (6.59) 


subject toO<2<r. 
For the function (6.59) and the more general 


F°(2) = Ef? (z,w) = Emax g(z,y,w) = Eg(z,y(z,w),w) 
y 

a statistical estimate of the subgradient takes on the form (under reasonable 
assumptions) 

€°(s) = gz(2°,y,w°)|y—y(28,we)s (6.60) 
where g, is a subgradient of g(-,y,w°) at z= 2°. 

To see that 
E{§(s)|2*} = F; (2°) 
for a convex function g(-,y,w), we can write 
g(z, y(z, w*),w®) y g(x*,y(x*,w*),w*) 2 g(z,y(x*,w*),w®)— 
~g(2",y(2",u"),u") > (aele"sy(e" 0"), 2") = (@(0),2—2°). 
Taking conditional expectation on both side, we get 
F° (2) — F°(2*) > (E{€°(s)|2*},2 — 2°), 


from which the assertion follows. 
Instead of y(z°,w?) we can use also y® such that y® € Y and 


g(2",y(z°,w*),w°) — g(2",y°,w*) Ses, 
where €, — 0 as ¢ — oo. It is easy to see that 


€°(s) = gx(z*,y,w°)|y—ys (6.61) 
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satisfy the condition (6.7). In (6.60), (6.61) we can apply also the approxi- 
mations (6.52), (6.53), (6.55), (6.57) with f°(z,w) = maxycy g(z,y,w) for the 
gradient or subgradient g,. According to the (6.60) for the objective function 
(6.58) we obtain the following expression for the €°(s) = (€?(s),..., €2(8)), 


€3 (8) = ai,;(w*),7 =1in (6.62) 
where 


n 
t, = argmax Yo ai (w*) 2} + 6; (w*) 
i si 


It means that for the stock problem (6.59) the scalar 


» if? > x° 


and we have the following simple version of the method (6.11). 

Let 2° be an arbitrary initial approximation and z* be the approximation 
obtained after the s-th iteration. A value x° is observed according to the 
distribution of the demand, for instance, through the Monte-Carlo simulation 
model. Since X = [0,r], it is easy to perform the operation of projection onto 
X and get the new approximate solution 


®t! = max{0,min[r,2° — p,€°(s))}, # =0,1,... (6.64) 
with the €°(¢) computed according to the (6.63). 
The usual approach to the solution of the problem (6.59) consists in the 
following. It is easily seen that 


F°(z) maf (e-s)an(e)-+9 [ (e-2)aH (2), 


and if the H(z) has the density (the distribution is absolutely continuous), 
then the function is found to be continuous differentiable. Then the solution 
is the nearest to the interval point satisfying the equation F2(z) = 0, which is 
equivalent to the following ; 


a+f° 


If there exists an algorithm for calculating H(z) then the solution of this equa- 
tion presents no difficulties. 

In applying the method (6.64) it is not required the differentiability of 
F°(z) the existence of the density. The distribution may also be given implicitly. 
it requires only observations x°, x1,...,x°,-.. and this feature makes the (6.64)- 
type methods applicable in cases when there is only the Monte-Carlo procedure 
available to simulate a possible demand. Consider the following problem which 
is discussed also in Chapter 21. 





H(z) = 
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Suppose that we have to determine the amount 2; of materials, facilities, 
etc., required at points 7 = 1: nm in order to meet a demand 


é 
x= 2 Ext, 
k=1 


where €%; is the raiidom flow of users from the residence point k = 1: @ to the 
demand point i =1:n. The users of residence point k are choosing the point 
4 with given probability p,;, k= 1: 2,7 =1:n, and there are also relations 


n 
So ens = bp, k=1: 8, 


f=1 


where 6; is the random quantity with known distribution function. The problem 
is to determine the size z; in order to minimize the cost function 


F°(e',...,2") = peor — xi), Bi(xe — 2i)} 


i=1 


subject toO <2; <r,,0=1: 7. 
The algorithm (6.11) with €°(s) as (6.62) takes the similar to the (6.64) 
form 
2? t) = max{0,min[r;, 2? — p.€?(s)]}, (6.65) 


oy.) far, ifa? ax? 
e={%%, nace 


é n 
x= hid ei = bf,k = 1,0, =1,n, 
k=l =1 


where 6¢, ef,;, x? are observations of the amount of users at point k, the flow 
variables and the demand at point 7 respectively. 

We note again that for the procedure (6.65) the distribution of the demand 
x; need not to be known: it is sufficient to have only a sequence of independent 
observations x?,x},.-.,X%,--. for each ¢ = 1: n. This circumstance allows 
us to solve by SQG methods fairly general inventory control problems (see [5], 
[7]). In the above discussed problem the distribution of the demands is hard to 
be found. 
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6.9 Recourse Problems 
One of the simplest recourse problem (see chapter 1) may be formulated in the 
following way: to find a vector z > 0 minimizes the function 


F° (2) = (c,z) + Emin{(, y)|T2+Wy <h,y> 0}, (6.66) 


where all elements of T, W, h, g, may be random variables. Here the decision 
is made in advance, before observation of w = (T,W,h,q), a corrective solution 
y is derived from the known w and z. 

Consider more general problem with the objective function 


F°(z) = Emin{g°(z,y,w)|g'(z,y,w) <0, t=1:m,yeEY}, (6.67) 
y 


where g”(-,w),v =0:m are convex functions, Y a convex compact set. 

Suppose that for each (z,w) there is a feasible second stage solution y (we 
can always obtain it by introducing some additional variables) and a saddle 
point (y(z,w),u(z,w)) of the Lagrange function 


m 
g° (2, -,w) + So ug! (z, -,w), 
t=1 


where y(z,w) is a second stage solution. Then an estimate of a subgradient of 
the function (6.67) takes on the form 


€°(#) = g2(2?5v,0") +>) wile? 0) 95 (2%, 9,4) y=u(etu8) (8.68) 


1=1 
Let us show that (under reasonable assumptions of measurability and in- 
tegrability) for the vector (6.68) 


F°(2) — F°(2*) > (E{€°(s)|2*},2 — 2°). 
We have 
ithe . 
g°(z,y(2,w),w) = 9°(2,y(z,w),w) + > u; (2, w)g' (2, y(z,w),w) 
=1 
for all (z,w). Let us denote g(z,w) = g°(z,y(z,w),w). Then, taking into 
account the last relation, we have 
q(z, w*) oa q(2°,w*) 2 g° (z,y(z,w*),w*) -9° (2°, y(2°,w*),w*) 
m 
+ oy uj(z°,w*)[g' (2, y(z, 0°), 4°) — g'(2°, y(2°, w*), w*)] 


r=1 


m 
> (ge(2", y(z*,w°),w*) + >) wi(2?, 0") 95 (2°, y(2",0"),w*), 2 — 2°) 
i=l 


cs (gy (2°, 9(2,w*),w*) 


+ x uj (2°, w?)gi,(2°,y(2°,w’), w*),y(2,w*) = y(2*,w*)). 


f=1 
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Since y(z*,w?) minimizes the Lagrange function, then we get 
q(z,w*) — q(2°,w’) > (93 (2°,y(2" ,w*),w*) 


+ s- uy (2°, w)g! (2°, y(2°,w*),w*),2 — 2°). 


f=1 


The assertion now follows from taking conditional expectation on both side of 
this inequality. 
From the formula (6.68) for the function (6.66) we get the estimate 


€°(e) =c+ u(2?,w°)T(w) (6.69) 


where w°,...,w®,... are mutually independent samples of w and the u(z?,w®) 
are a dual variables corresponding to a second-stage optimal plan y(z*,w®). 
From formula (6.69) and the convergence of the procedure given by (6.11) we 
can obtain the following method for solving a recourse problem. 

(i) For given z° observe the random realization of h, g, T, W, which we note 

as h(s), g(8), T(e), W(s); 
(ii) solve the problem 
(q°,y) = min, 
W (e)y < h(s) —T(e)2°, 
y20 


and calculate the dual variables u(z°, w®). 


(iii) Get 
€° (2) =e + u(2?,w*)T (w®) 


and change 2°: 


z°t! = max|0, 2° — p,€°(z)]. (6.70) 


It is worthwhile to note that this method can be regarded as a stochastic iter- 
ative procedure for the decomposition of large scale problems. For instance, if 
w has a discrete distribution, i.e., w € {1,2,...,N} and w = & with probability 
pr, then the recourse problem (6.66) is equivalent to the problem: 


{c,z)  +pi(g(1),¥(1)) +p2(o(2),9(2))... +pw(a(),¥(N)) = min 


T(1)2 +W(1)y(1) < h(1) 
T(2)z +W (2)y(2) < h(2) 
T(N)e - +W(N)y(N) SAN) 
e>0, y(l) 20, y(2) 2 0, y(N) > 0, 


were y(k) is the correction of the plan z if w =k. The number N may he very 
large. If only the vector h = (h;,...,4m) israndom and each of the components 
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has two independent outcomes, then N = 2. Then the SQG procedure (6.70) 
allows us to solve extremely large-scale problems. 

The formula (6.69) is also applicable for the dynamic version (see chapter 
1) of the problem (6.66): find a sequence 2 = (z(0),2(1),...,2(Z')) minimizing 
the function 


F°(z) = 2s e(t),2(t) > +Emin(t){ (a(6),v(e))IID1A¢ ka(k) 


+ Bry(t)] < A(t), y(t) > 0} 


(6.71) 


subject to z(t) >0,t=0:T. 
The estimate takes on the form: 

(i) For given 2° = (2°(0),2°(1),...,2°(T')) observe a random realization of 
d(t), h(t) Ack, By for k =0:t,t =0:T, which we denote as d°(t), b(t), 
Ack®, BP 

(ii) Solve the problem 
(2(0)ou(0)) = min, 

DezolAte2* (A) + Bey (t)] < h°(¢), 


y(t) 20, 
for t = 0:T and calculate the dual variables u°(t) to an optimal solution 
y(t). 
(iii) Calculate 
T 
€°*(s) = c(h) +55 ut(t)At, k=0:T. 
t=k 


The vector €°(0) = (€°°(s),...,€°7(s)) is an estimate of a subgradient F2(z°) 
(according to the rule (6.69)). 

Therefore the method (6.70) applying to the problem (6.71) takes on the 
form: in addition to (i)—(iii) change z° according to the formula 


z°*!(t) = max{0,2°(t) — p.€°(«)],t =0:7 
and repeat (i)—(iii) with 2¢+! = (2¢t!(0),2¢t}(1),...,2°*1(7)), etc. 


The general formula (6.68) as well as (6.61) can also be modified according 
to all universal rules discussed in the Section 6.7. 


172 Stochastic Optimization Problems 


6.10 Stochastic Problems with Composite Functions 


Until now we have discussed solution procedures for the problem (6.48)—(6.50) 
assuming that we know exact values of random function f’(z,w), v =0:m for 
fixed z, w. Meanwhile there are important problems in which these values are 
not known—problems with so-called composite (objective, constraints) function 
F” of the following structure 


FY’ (2) =Ef" (z,w),v =O0:m, 


6.72) 
Sf" (2,w) =" (Eg! (2,w),...,£9°(2,w),2,w), ( 
where some of functions g!,...,g® itself may have the same type of structure, 


etc. 
The penalty functions of the problem (6.48)—(6.59), for instance 


Ef? (z,w) +OS> Emin{0,£f(2,)} 
i=1 


are examples of such objective function. 
The moments 


@ 
E IIe‘ (2) — Eg* (2,w)]"* re > 0 
k=1 


are also such type of functions, where g!(z,w),...,g(z,w) are given random 
functions. 

For composite functions there are difficulties with computing the estimates 
of values F”(z*). The averaging operation allows us to overcome these difficul- 
ties in the similar way to the procedure (6.16), (6.17). Consider an illustrative 
example (see [5]:pp 201-215 for more details). 

Let us suppose that there are mathematical expectations of only two levels: 
in functions q” and F”, therefore let us suppose that the values g!(z,w),...,9° 
(z,w) are calculated exactly for each (z,w) and let {z°}325 be a bounded se- 
quence of approximate solutions. Define the estimates (s) by the formula 


G(e +1) =F(8) + 4o[9(2°*!,w?t") — F(c)],6 =0,1,... (6.73) 


where g(z°,w?) = (g!(2*,w*),...,9°(x?,w*)). According to the general result 
(6.23) under rather general requirements with probability 1 


lla(¢) Te E{g(2’,w°)|2°}}| — 0, for s 00 
Therefore, g(s) is an estimate of the Eg(z*,w) and ¢”(9(s),2°,w) can be used 


as an estimate of the value {¥ (z°, w). 
Assume that 
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(i) g’(-,w), g*(-,w), v = 0: m, k =1: 1 are continuously differentiable 
functions, Vw € 2 

(ii) values g¥(z,2,), g¥(z,2,w), g&(z,w), v= 0:m, k= 1: € are calculated 
exactly. 


Then vectors 


€ 
€” (8) = a2 (G(¢),2°,w) + D> ae, ((e), 2") 95 (2°, w) 
k=1 


can be used as estimates of Fy (2*) in different type of solution procedures 
discussed in Sections 6.2~6.3. There might be also modifications with finite- 
difference approximations of gradients involved in the €”(#) and generalizations 
to the case of nondifferentiable functions. 


6.11 Problems of Optimal Control 


From the discussion of the Chapter 1 it follows that rather general problems of 
optimal control with discrete time can also be viewed as the (6.48)—(6.50) type of 
problem with implicitly given objective and constraints functions. We consider 
only discrete time problems here. Continuous time problems are very often 
only mathematical approximations of really discrete time problems obtained in 
the hope of simplifying analytical formulas. From the computational point of 
view they are required to be approximated by discrete time analogous again. 
Under natural assumptions an optimal control law of continuous time problems 
(without “pathologies”) can be approximated (in terms of objective function) 
by optimal control law of the discrete time analogous (see [$3], [29]). 

Supp ose we are interested only in time valuest = 0,1,2,...,7 and variables 
= (2(0),2(1),...,2(T)), Z = (z(0),2(1),...,2(Z')) represent control actions 
and the state of the system over a given time-horizon, respectively. The problem 
is to find a control z(t), = 0: T as a function of t which minimizes the objective 


function 
F(z) = E7°(z(0),...,2(T),2(0),...,2(T — 1),w) (6.74) 


subject to 
F' (2) = Ey’ (2(0),-..,2(T),2(0),...,2(7—1),w) <0, t=1:m, (6.75) 
where variables z, z are connected by the system of stochastic equations 


At +1) =9(t,2(0),2(),4), 

‘ (6.76) 
z(0) =z ,t=0,1,...,.7-1. 

with the vector function g = (g',...,g"). Suppose that g* (t,y,2,w), k=1:7 

are continuously differentiable functions with respect to (y,z) fort =1:7—1, 

w € Mand let (2, 2)” (%,Z, w) be a subgradient of 7” (7,%,w). According to the 
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equation (6.76) z = (z(0),...,2(T)) is implicit function of F = (z(0),...,2(T — 
1)), therefore, 
4’ (ZF, w) =f’ (Fw). 


If we denote components of vectors 72,2)”, {¥(Z,w) as the following 
sites, = (¥2(0))-°- sVa(T)9* +) Ya(0)9+*#9 Va(T-1)) 


z= (fx(oys +++ 2(T-1)) 
then in rather general cases (see [5], [29], [$3]) 


fea) = ve) 7, Z,w) Ds (t + 1)gk(t, z(t), 2(t),w), (6.77) 
where 
A(?) => alot Note) r2(t}o4) ~ 7" F 4)» gray 
MT) =-%p)(%F,w), t=T-1,7-2,. 


For instance, if, in addition, vectors Ix) (FF), Vea) (FF), g(t,z,2), 
g(z,2,w), g*(z,2, )s vy =0:m, k =1:¢ are bounded in any bounded set 
of (z,z); functions +” (7,%,w) are weakly convex with respect to (z,z), then 
functions F(Z) are weakly convex and its subgradient 


Fz (2) = Ef; (z,w) 
Therefore, the estimate of the subgradient FY 


€" (a) = (fr(0)(7*,2*,4°),---, (7-1) (2) F*,w*)) 


where Z° is pa current approximate solution, w® is independent of B, an ob- 
servation of w, 2 is the solution of the equation A 76) for given w = @, 2 = 3 
And again faa of exact gradient g*, g* and subgradients V(t) rw 
their finite difference approximations might be applied (see Section 6.7). 
Consider more concrete example of problem (6.74)-(6.76). Suppose the 
system’s equations are linear 


g(t, z(t), x(t),w) = A(t,w)z(t) + B(t,w) z(t) + C(t,w), (6.79) 


7° (%,7,w) = max, lle() - 2" (1? 


where z*(t) is a given (observed or prescribed) trajectory. 
The problem is to find such a vector Z minimizes the expected deviation 
from the trajectory 2*(t),t =0:T. 


F(e) = £ max, lle() - 2°01. (6.80) 
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The system (6.78) takes on the form 


A(t) =A(E+ A(t) + fey (6.81) 


MT) =—figt=T—-1,..-,0. 


f,(t)° = 1 2eet —z*(t)), if \|z(t) — 2*(¢)|| = maxo<e<r|lz(t) — z* (e)], 


0, in other case, 
fort =0:T. A stochastic subgradient €°(¢) = (f2(0)> vey (7-1) 


fa) = —A(E + Blt). 


Therefore, the method (6.11) applying to this problem is reduced to the follow- 
ing computation. Suppose that x(t) € X(t)—a convex compact set. 
(i) Let 2 be the current approximate solution. Make observation w® and find 
the solution z*(t),¢ = 0:7 of the equations 


z°(t +1) = A(t,w*)z9(t) + B(t,w*)2* + C(t,w*) 

z(0) =79,t=0:T-1 

Find the solution \°(t), = T,T —1,...,0 of the equations (6.81) with 
2(t) = 2°(t),t=0:T, w=’. 

Compute €°(e) = (f2()>+++9fe¢r—1)) 

freq) = (E+ I) B(t,w*) 

and the new approximate solution +! = (x°+!(0),...,2°t!(T — 1)), 
where 

2th (t) = mxi(e)[2°(t) + pedA°B(t,w*)], T=0:T —1,6=0,1,... 


— 


(ii 


— 


(iii 
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6.12 Optimization Involving a Preference Structure 


Many complex decision problems involve multiple conflicting objectives. Gener- 
ally, we cannot optimize several objective simultaneously; for instance, minimize 
cost and at the same time maximize reliability. If we can find some function 
(utility function) that combines all objectives into a scale index of preferability, 
then the problem of decision making can be put into the format of the standard 
optimization problem: to find z € X to optimize the utility function. The 
finding of a utility function may be a very difficult problem and often it is easy 
to have a preference ordering (preference structure) among feasible solutions 
z € X and deal with this structure directly to get the preferred solution. This 
ordering may be based on the decision maker’s judgment or other rules. So let 
us assume that the decision maker has a preference structure at different points 
x € X and there exists a utility function (unknown) U(z) such that 


a! ~ 2" <=> U(2) =U(2"),2' 2 2” = U(2') > U(2"). 
Consider the procedure 


gett _ ax (2° + pot (s)), 
(6) _ jhe if2°+A,h° = 2°, 
~ [| -h, if? +A,h* S 2°, for some A, > 0. 


where h°,h!,...,h°,... are the results of independent samples of the random 
vector h = (hi,.-.,n) uniformly distributed over the unit sphere. It can be 
shown [?| that 


E{é(2)|2*} = aW,(2*) 
U-(2*)|I, 


for differentiable U(x), where a is positive number. Therefore, the convergence 
of this procedure follows from the general conditions of the procedure given 
by (6.11) (with small corrections). A series of similar procedures for general 
constrained problems with nondifferentiable utility functions was investigated 
in [64]. 


Stochastic Quasigradtent Methods 177 


6.18 Mathematical Statistics Problems and the Stochastic Opti- 
mization 


Many problems of the mathematical statistics can be formulated as special 
cases of the stochastic optimization problems. Such interpretation allows us to 
bring ideas of mathematical statistics into stochastic optimization. Simultane- 
ously, it gives opportunity to apply the developed optimization technique in the 
mathematical statistics. Consider some possible applications of SQG methods 
(see also [5], [28]). These methods allow us to construct iterative procedures 
which can be performed on line and can use a priort information concerning 
the structure of the system for improving estimates. 

Many problems of statistical estimation deal with the problem of estimat- 
ing the true value z* of unknown parameters from the elements of a sam- 
ple 6°,0/,...,6%,... assumed to have been drawn from a distribution function 
H(y,z*) = P{@ < y}. There may be different formulations of optimization 
problems concerning such problems of estimation depending on our knowledge 
about H(y, z*). 

(i) There is no information about H (y,z*) except the sample 6°, 6!,...,0%,... 
and z* = £6. Therefore the problem is to estimate z*, where 


6° = 2* + e(6),Ke(s)=0, &¢ =0,1,... 
The required parameter z* minimizes the function 
F° (2) = Elj2 — 6|’, (6.82) 
because z = £6 satisfies the optimality conditions 
eo =2(2,-£0;)=0, t=1:n. 

If @ priort knowledge about the unknown 2 is introduced as z € X, then 
from (6.11) we could obtain the following iterative procedure for finding 
z* (with €°(8) = 2(2° — 9°*')): 

g°t) = 7x(2° — p,(z°—9°T1)), 2 =0,1,.... (6.83) 
If X = R", p, = ay then 


1 1 e+1 
o+1 8 o geF ly k 6.84 
etl ae (« ) ao (8.84) 





stl 


The estimation (6.84) is the sample mean. The advantages of the estima- 

tion (6.83) when compared to (6.84) are 

(a) if X # R", then from (6.83) it follows that 2° € X for alls =0,1,..., 
whereas in (6.84) only limz® € X. Therefore the estimations from 
(6.83) must be better for small samples. 
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(b) possibilities of choosing p, as a function of (z°,...,2°) in order to 
decrease the value of the objective function (6.82). It can be done by 
using adaptive ways of choosing p, (interactive or automatic), as it is 
described in Section 6.6. It leads to different nonlinear estimations of 
z* in contrast to the estimate (6.84) which is the linear function of 
observations. 


Problems of estimation of the moments 
E6°, E|6|°, (6 — £0)° 
may also be formulated as minimization problems 
FP (2) = lle — 6°? 2 (2) = Elz — |0(°|?, 
F3 (2) = Eljz — (6 — £9)", 
where for the sake of simplicity we denote 
0 = (6f5..-488), 101° = (Joles--+s Bnl®)s (8 — £8)" 
= ((01 — £61)*,..., (On — E0n)°). 
The stochastic gradients of these functions are: 
EF (2) = 2(2* — (0°*1)'),€5 = 2(2° — |o°*"9), 
é 


€3 (e) =2(2° — [] (0°*! — 0° +1+4)). 
k=1 


(ii) Suppose now that we have the information 
EA=V (z)|z=2*; 


where V(z) is a given function and 2* is an unknown vector. Then 2* 
minimizes the function 


E||V (2) - 9. 


(iti 


— 


If we have information about the density p(y, z*) of H(y, z*) with a measure 
u(dy), then it could be shown that z* maximizes the function 


E \np (2,6) =| Inp (2, y) p(y, 2*) (dy). 


These problems are reformulations of well-known principles for the least square, 
1.e., minimization of the function 


1 N 
2ST AV (=) — 8? 
k=) 
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and maximum likelihood, i.e., maximization of the function 


1 N 
wd Inp (2, 6*). 
=1 


It gives us a good opportunity to apply SQG methods. 

The above mentioned problems are the problems of pure estimation. Very 
often the main reasons for estimation and identification are control or optimiza- 
tion. In some cases, the task of optimization and estimation can be separated 
and optimization is performed after estimation. However, in the problems of 
adaptation it is usually necessary to optimize and estimate simultaneously. For 
instance, optimization cannot be separated from estimation if the observation 
of unknown parameters depends on the current value of the control variables. 

Arising in such environment optimization task requires the development of 
a hew optimization technique which have much in common with minimization 
of time-varying functions—the nonstationary optimization (see Section 6.4). 

Consider an illustrative example--minimization of the differentiable func- 
tion 

F°(2) = o(z,2*), 2ER" 


where z* is a vector of unknown parameters. At each iteration ¢ = 0,1,..., 
an observation 9° is available which has the form of a direct observation of the 
parameter vector z*: 


E6e=2z*, «=0,1,... (6.85) 


The problem is to create a sequence {x°}2°) which converges to the set of 
optimal solutions. Note that F°(z) cannot be optimized directly because of the 
unknown parameters z*. However, at iteration s we could obtain a statistical 
estimate z° such that 2° — z* with probability 1 and a sequence of functions 


F°(2z,s) = %(z,z*) such that 
F°(2,8) + F°(2) 
with probability 1 for 8 — oo. 
Let us notice that at iteration s only the function F°(z,s) is available. 
Therefore we led to the procedures of the nonstationary optimization 
aot! = 2° — p,F°(2*,8), 6 =0,1,... (6.86) 
F(z, 8) = Ws(2, 2°). 


In the case of stochastic programming problems z* may correspond to the vector 
of unknown parameters of the probability measure P(-,2*) 


Hle2)= f 1(e,u)P(de, 2). 
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If {z*} is a sequence of estimates z* — z* with probability 1, then we led to 
the following type procedures 


att! = 9° — p,6°(e), 
where €°(s) is an estimate of the FO (z,6) at z= 2°, 
Mee CiCe e i f(2,w)P(dw, 2°). 
For instance, similar to the Section 6.7, 
€°(6) = f2(2*,w*), 


where w® is an independent of the B, sample of the w drawn from the non- 
stationary distribution P(-,2*). We can also use more complicated estimates 
(similar to (6.52), (6.53)) More difficult problems arise when #°, ¢ = 0,1,... 
are not direct observations of the vector z*. In other words, if, instead of the 
relationship (6.85), we have the following (see [20], (75], [76]). 


E{0*|2°} = p(2°, 2*), 


which may depend on the current approximate solution 2°. Since we do not 
know z® in advance, then the (6.86) type procedure that directly solves an 
optimization problem and simultaneously estimates the z* is needed again. 
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CHAPTER 7 


MULTIDIMENSIONAL INTEGRATION AND STOCHASTIC 
PROGRAMMING 


I. Deak 


Abstract 


A survey of well-known techniques and some recent results in multidimensional 
integration is presented together with a list of references. Methods are inves- 
tigated with emphasis on their applications in stochastic programming. Also 
some results are reported on the Monte Carlo computation of the distribution 
function and probabilities of rectangles in case of multinormal distribution. 


7.1. Introduction 


In several problems of stochastic programming the evaluation of some kind of 
n-dimensional integrals is required. Of course, multidimensional integration is 
necessary in many other fields, too. Generally when one takes more aspects of 
the problem into account at the same time and wants to obtain a kind of general 
assessment of the problem one is faced with multidimensional integration. 

There are some survey papers on multidimensional integration, e.g. Haber 
[24], Halton [25] and also there are the books Stroud [49], Ermakov [15] and 
that of Davis and Rabinowitz [5]. Especially this last one can be recommended 
for interested readers. Unfortunately no recent attempt to give a survey of 
the state of the art is known to the author (the survey paper of Niederreiter 
summarizes only quasi Monte Carlo methods). Since the subject of multidimen- 
sional integration is rapidly extending and no unique solution procedure can be 
judged at present to be the best, it is necessary to give at least an overview of 
the main streams at the moment. 

First we describe some problems in stochastic programming where evalua- 
tion of multidimensional integrals is required. In Section 7.3 general methods of 
multidimensional integration are discussed with emphasis on those applicable 
in higher dimensions. Here we point out the advantages and drawbacks of the 
methods. In Section 7.4 some general difficulties encountered in multidimen- 
sional integration are considered. 

In order to show the power of the Monte Carlo method we present some 
results in computing the multinormal distribution function and probabilities 
of rectangles in Section 7.5. Finally in the last section solution strategies are 
given a possible user how she or he should choose the method depending on the 
problem and the number of dimensions. 
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7.2 Integration problems in stochastic programming 


Problems of evaluating multidimensional integrals generally can be written as 


[ seriqarte) (7) 
D 


where D is a d-dimensional set, g,(z) is a function to be integrated, v,(z) is a 
weight function (sometimes g(x) or v(z) or both of them equal to 1), F(z) isa 
distribution function of a probability distribution. 

In probability-constrained models presented by Prékopa [44], distribution 
function values and its gradients are needed. For example in the following 


STABIL model (see Prékopa et al [45]) 


minimize mine’ z 
subject to Az=b 
220 


P{T'z > €}p, 
where €isa random vector with distribution function F we need the evaluation 
of the integral 


F(t)= [* ae 7 dF (2). 


In some other cases, e.g. in the approximative solution strategy devised by Kall 
[$0] for the two-stage stochastic programming model probabilities of rectangles 
are used (i.e. D is a rectangle, g(z) = v(z) = 1 in (7.1)). 

The case when D is a simplicial cone is of interest, since fast computation of 
such integrals would make possible another solution procedure for the two-stage 
model as it was pointed out by Wets [55]. 

The two-stage stochastic programming problem is the following 


minimize c’z + ¥(z) 
subject to ¥(z) = E(Q(z, €)), 
Q(z, §) =infl¢ylWy =h- Tz 


where all components of h,q,W and T might be random variables, € is the 
random vector comprising all the random terms on the right hand side, and its 
distribution function will be denoted by &. Thus we have 


ve)= | eval). 


The evaluation of this function ¥ seems to be no simple problem (the integrand 
is a sophisticated function, to obtain only one value one has to solve a linear 
programming problem). It is highly probable that here direct integration can- 
not be applied, only reformulation of the problem or approximation schemes 
suggested by Strazicky [48], Kall [$0], Kall and Stoyan [81], Wets [56] could 
overcome this stumbling block. 
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7.8 General Methods of Multidimensional Integration 


There is a wide variety of methods applicable only in low dimensions d = 2, 3,4 
with good effect. We deny by no means the merits of these techniques but from 
the point of view of stochastic programming they seem to be of little value if 
they cannot be generalized or applied in greater dimensions. Just to give some 
examples of works in this category we mention some references. Donelly [13] 
expanded and integrated the two-dimensional normal density function with a 
very high degree of precision. Milton [88] and Dutt [14] suggested methods for 
computing normal integrals up to six and four dimensions respectively. Similar 
work for two-dimensional cases were published by Brown [8] and Drezner [18]. 
Integral formulas for the three-dimensional sphere were developed by Freeden 
[19] and Lebedev [$6]. See also the papers of Terras [58] and Tsuda [54]. 

In multidimensional integration an important role is played by the change 
of the order of integration, approximations to the integrand, the many different 
kinds of integrand transforms and composite integration rules. Since most of 
these methods seem to be specific ad hoc methods, in what follows we will 
focus on the general methods of multidimensional integration, especially on 
those applicable in higher dimensions (d > 5) . 


7.8.1 Product Rules 
By an integration rule R(z,;,w;) we mean a set of points z;, weights wj, i= 
1,...,M and the approximation of the integral: 


M 
i: S(z)dz~ D> wf (x). (7.2) 


i=1 


By a product rule we mean a product of two lower dimensional rules. More 
precisely assume that D = Bx C where z denotes the Cartesian product, 
BcR1,0Cc Re d, + d2 = d furthermore we have an integration rule 
R,(2,,;, w1;) with M, points in B and another rule R, (z9;,w2;) with Mg points. 
The R = R, X Rg produce rule consists of the £1) £95) points with weights 
(wij,waj) 7=1,...,Mi,j =1,..., Mo and gives the approximations 


M, M2 


Dy wis s (255295) (7.3) 


i=1 j=1 


Application of product rules, especially if the same, say, one-dimensional rule is 
applied repeatedly, is easy. However there are cases where problems arise when 
D cannot be decomposed into a Cartesian product and also when the number 
of points in the product rule grows very big and thus the application of product 
rules are doomed to failure. 
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7.3.2 Rules Exact for Monomials 

These rules are developed for exactly integrating monomials of type II¢_,2,", 
up to a certain degree k = a; +--:+ aq. E.g. a rule exact with degree 2 
can be determined by solving the following nonlinear system of equations for 
wi, t,t =1,...,M, 


M 
Degree 0 7 -/ dz, 
i=1 D 


M 
Degree 1 So wei; -/ aj;dz, j=1,...,d, 
D 


i=1 


M 
Degree 2 S> wieizeie -/ zjeedz, £=1,...,d,7 =1,...,d. 
D 


i=1 


The number M of the points necessary for integrating monomials with 
degree & cannot be determined explicitly as a function of & and d, but according 
to well-known theorems the inequality 


Car) ses (7) 4 


holds where [ ] means the integer part function. Generally the number of 
equations to be solved may be quite large, though some work has been done by 
Keast and Diaz (1983) in reducing the number of equations in a special case. 


7.8.8 Quasi Monte Carlo Methods 
These methods, contrary to what might be suggested by their name, use a 
carefully selected, deterministic sequence of points. Such sequences do not look 
like random sequences and nobody forces us to believe it. some papers call 
these methods also number theoretical methods. 

Consider a sequence of points S = {z,,...,2y} in the unit cube K = 
{z|O < x <1}, then for the error of the approximating sum tee f(z,) we 


have a 
| f se)de- Yo Medl s Dx(S\¥(N) (7.5) 


Here Dy (a) denotes the discrepancy of the sequence S and is given by 


where # denotes the number of points in the set, V(f) is the d-dimensional 
variation of the function f in the sense of Hardy and Krause (see Zaremba 
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[59] and Niederreiter [41]). Since we can do little about V(/) we try to select 
sequences with small discrepancy. There can be found a sequence for which 


(log N)¢~? 
Dy{S) < e-———_——- 
w{S) Sr 
with a constant c,;, but according to a result of Roth for any sequence S we 
have 
(log V)(4-1)/2 
N 


so this limits our hopes to find good sequences. 

Research related to this field has been done by Zaremba [59], Sugihara 
and Murota [50], Cranley and Patterson [4]. For a comprehensive treatment 
the reader is referred to Niederreiter [41] which contains almost four hundred 
references. 

Similar very closely connected research carried out by Korobov [$4], Hlaw- 
ka, Zaremba [61], Niederreiter [$9], [43], [43] is called the theory of good lattice 
points (or optimal multipliers). This research consists of finding such a vector 
@ for which the error of approximation given by 


Dy(S) > cq 


M . 
| slelde— 5 De SUCgza (7.8) 


would be small, where { } means the fractional part of the number. 

The advantage of these methods is the fast convergence since the error is 
O(log N/N). There are several successful implementations in low dimensions 
(about 2 < d < 10) but in higher dimensions the method is likely to run into 
difficulties. 

For regions different from rectangles and for some simple function the the- 
ory has not been yet developed. E.g. consider the function g{z,y) = O if 
z < y otherwise g(x,y) = 0; this function has unbounded variation V (g). In 
connection with the discrepancies some research has been made by Braaten [3] 
who defined a discrepancy measure invariant under reflections. Probably some 
similar results would be needed for the variation V(f). 


192 Stochastic Optimization Problems 


7.8.4 Monte Carlo Methods 


This is a kind of integration where one uses—in theory—random points, the- 
oretical justifications hold for this case. In practice points produced by deter- 
ministic procedures are used, that look like random (sometimes they are called 
pseudorandom, more frequently random). The essence and the main types of 
Monte Carlo computations are elegantly described by Hammersley and Hand- 
scomb [26]. 

The integral is approximated by the estimator 


ve) (7.7) 


where € pon 3 y are samples from the uniform distribution in D. The standard 


deviation of (7.7) is D(f(é))/,VN this quantity is used as the error of the result 
in Monte Carlo computation. Generally this error is quite large and thus one is 
bound to use variance reduction techniques i.e. to construct estimators having 
less variance than the estimator (7.7). 

Several such techniques have been devised, e.g. importance sampling, strat- 
ified sampling, the method of control variates and that of antithetic variates. 
Ermakov and Zolotukhin [16] proposed the expansion of the intergrand into a 
sum of orthogonal functions; this method was recently supplemented by details 
that make it computationally feasible by Bogues et al [1]. 

As an interesting approach we mention the work of Yakovitz et al [58] 
who gave estimator containing nonlinear combinations of the functions values 
and thus obtained convergence faster than O(1/ VN. ) but only up to dimension 
d=4. 

The implementation of the Monte Carlo method is easy and can be done 
for almost every kind of function and integration domains (infinite ranges of 
integration have to be truncated). The deviation of the estimator (the error) 
can be computed with little additional effort and is sharp. Also note that 
integrals in very high dimensions can be computed by Monte Carlo method, 
e.g. in Deak [7] an example in d = 50 dimensions was presented. The trouble 
in Monte Carlo computations is with the accuracy which usually covers two or 
three digits only and with the very slow convergence. 
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7.4 Difficulties in Multidimensional Integration 


Compared to the one-dimensional integration one encounters much more diffi- 
culties in integrating d-dimensional functions. 

First of all the variety of domains of integration should be observed. In the 
one-dimensional case only the interval and the half-line have to be considered 
as possible domains, while even in two dimensions we have an infinite number 
of domains that cannot be transformed by affine transformations into each 
other. In practice simple regions, like cube, sphere, cone, simplex-torus, etc., 
are selected. Sometimes there is a possibility of transforming one of them 
into another (e.g. a sphere into a rectangle via polar transformation) but the 
resulting clumsy function in most cases deters the user from applying it. Also 
there is the possibility of subdividing a cube, say into simplices (see for example 
the paper of Good and Gaskins [22]) but in most cases the number of resulting 
subregions makes feasible this procedure in low dimensions only. Subdivision, 
if any may be done only in an adaptive or even in an interactive way, aS can 
be seen from the papers of Friedman and Wright (20] and Kahaner and Wells 
[29]. 

Another problem is the so-called dimensional effect. In the application 
of product formulas we have to tackle with the following inconvenient phe- 
nomenon. If we need M points for integration in one dimension to achieve 
a given accuracy (in the sense that polynomials of a given order can be inte- 
grated exactly), then applying this rule repeatedly in d dimensions we require 


M? points. In the case of nonproduct formulas (rte 2] 


) points are required 
at least for exactly integrating polynominals of degree &. It means that the 
necessary number of points (amount of work) grows much faster than the num- 
ber of dimensions. One possibility to conquer the dimensional effect is to use 
Monte Carlo methods in great dimensions. 

Generally the estimation of error is difficult; usually two rules are indepen- 
dently applied and the difference between the two results is used as the error. 
This way however, we are likely to overestimate the true error by orders of mag- 
nitude. One may always resort to Monte Carlo methods, nevertheless a better 
idea is proposed by Laurie [$5], or the more general way of randomization of 
determninistic methods of Cranley and Patterson [4] can be recommended. This 
last one creates a family of rules by introducing a random parameter, and sam- 
pling from this family enables the construction of confidence intervals for the 
magnitude of error. 

One should observe the use of the optimization and the mathematical pro- 
gramming in the field of the multi-dimensional integration as in Mantel and 
Rabinowitz [87], as well as Friedman and Wright [20]. Maybe this is the way 
to make adaptive subdivision really practical? 

Finally we mention that the theory of orthogonal polynomials, so fruitful 
in one dimension, does not carry over to the d-dimensional case, only some part 
of the whole can be saved (see Davis and Rabinowitz [5]). 
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7.6 Computation of Multinormal Probabilities by Monte Carlo 
Methods 

Here we summarize some results on computing the distribution function ® of 
the multinormal distribution, that is the value 


hg: “RS 
p= (h) =)_...) o(z)dz, (7.8) 


where i ; 
= —_________¢ aa Ro} 
¢(z) (27)4/2|R]#/2 exp{ 9” x} 
is the density function of the n-dimensional normal distribution with expecta- 


tion O and correlation matrix , furthermore the probability qg of a rectangle 
Q = {zla < x < b} that is the value 


= ‘f e(zx)dz. (7.9) 
Q 


The main result on the evaluation of the value p has been described in Deak [7], 
while details on the computation of q can be found in Deak [10]. However the 
main idea will be demonstrated here. Denote by g(z) the characteristic function 
of Q, that is g(z) = 1 if z € Q, and g(z) = 0 otherwise. Let € be a random 
vector with density y, it can be written as € = x where x is a x-distributed 
random variable with d degrees of freedom (its distribution function is Fy), 7 is 
uniformly distributed on the surface of the hyperellipsoid E, = {z|2’R-1z = I}, 
its distribution function will be denoted by V (y). Using these notations we can 
decomp ose (7.9) as 


a=fewe=f oaaa=f [ otyermary. (19 


Let r; and rg be the entry and exit constants of a vector y with respect to the 
domain Q, that is ry € Q holds ifO <r; <r < rq. Then define the function 
e as the probability content of the line y as follows: 


9O 
e(v) -[ a(ky) dF, (k) = Fy(ro) — Fa(r1) 
Thus from (5.3) we have the following unbiased estimator of q. 
ix 
= nde e(y,) 


where y Ypcc Vy are independent realizations of the random variable yn. An 
estimator with smaller variance can be obtained if we use a set of dependent 
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vectors, an orthonormalised system of vectors instead of the independent vector 
y, The estimator O; is obtained if the sum of k vectors from this orthonor- 
malised system of vectors is employed instead of the vector y,- 

In the paper Deak [7] very fast machine coded random number generators 
were used to compute probabilities p. Recently we made an attempt to develop 
an easy-to-use subroutine system on an IBM 3031 computer. It was completely 
FORTRAN coded and only standard, very well-known techniques were used for 
random generation (a multiplicative congruential uniform generator in double 
precision and the polar method for generating norma! samples). Some execution 
times are given in the following Table 7.1. 


Table 7.1 


Empty loop 3 jesec 
Uniform generator 70 psec 


Polar method 186 psec 
Square root 56 psec 





We implemented only the estimator O2 for computing distribution function 
values p. In order to obtain probabilities with error less than 0.01 (i.e. their 
standard deviation is less than 0.01). 


Table 7.2 


d time(sec} 








Times necessary to compute d-dimensional distribution functions values with 
two accurate digits we need less than 0.4 sec (up to 20 dimensions) see Table 
7.2. Times necessary to compute d-dimensional probabilities g of rectangles 
with two accurate digits do not exceed 0.6 sec up to d = 20 dimensions, see 
Table 7.3. More details can be found in Deak [9] or in Deak [10]. 


Table 7.8 


time(sec) 





Times necessary to compute d-dimensional probabilities of rectangles with two 
accurate digits. 
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7.6 Solution strategies 


In order to solve a stochastic programming problem, where multidimensional 
integrations are also involved, one has to experiment with several approaches. In 
this section we propose an order of priority of the different solution techniques. 


First try 


Solve the problem for specific practical cases explicitly; as for example Hansotia 
[27] solved the two-stage stochastic programming problem in case of normal dis- 
tribution, or as Ewbank [17] gave a closed form expression for the distribution 
function of the maximum in a stochastic linear programming problem. 


Second try 


Consider an approximating discrete distribution and solve the resulting system 
Strazicky [48], Kall [$0], Kall and Stoyan [$1], Wets [56], Wets [57]. One must 
note here that sometimes the astronomical number of approximating problems 
or the size of the problem render the solution practically impossible. 


Third try 


Experiment with product forms or rules exact with a given degree in low (d < 5) 
dimensions and with quasi-Monte Carlo methods in low and medium dimensions 
(d < 10). 


Fourth try 


Use Monte Carlo methods in dimensions (d > 5) 


Fifth try 


Reduce the variance of the Monte Carlo estimator, developing special techniques 
for the given problem. 


In the following Table 7.4 we summarize our preferences on the usage of 
the different multidimensional integration methods. The greater dimension we 
have the more random will be the method applied. This is not a coincidence 
since there is very strong evidence to do so. 

The Sarma-Eberlein error estimations indicate that in very high dimensions 
Monte Carlo methods becomes best (see Stroud [49]), the work of Yakowitz 
and al. [58] demonstrates that the convergence rate of the nonlinear estimator 
decreases with the number of dimensions and finally in Deak [7] computer 
experiences showed the simpler estimator’s performance to be better with the 
increase of the number of dimensions. 
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Table 7.4 


Methods of 
integration 


Crude Monte Carlo methods 


Monte Carto methods 
simple variance reduction 


Monte Carlo methods 
sophisticated variance reduction 


quasi Monte Carlo 


Nonproduct forms 
expansion 


into series 


Product forms 


number of dimensions 





23 4 5 10 15 20 
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CHAPTER 8 
STOCHASTIC INTEGER PROGRAMMING 
A. R. Kan and L. Stougie 


8.1 Introduction. 


This short chapter on stochastic integer programming will be quite different in 
nature from the preceding ones. To a large extent, this difference reflects the 
way in which current research traditions in integer programming differ from 
those in other areas of mathematical programming. 

Initially, integer prograraming was concerned with a simple and yet funda- 
mental extension of the generic linear programming model 


minimize cz (8.1) 
subject to Az=b (8.2) 
2>0 (8.3) 


obtained by adding the constraint 
2eE2Z", 


Methods to solve this generic snteger programming problem were sought in the 
hope that their efficiency would match the efficiency of the ssmplez method for 
linear programming. Since virtually every optimization problem encountered in 
practice turned out to allow formulation as an integer program, such a method 
would be a truly formidable solution tool. 

Rapidly, however, it appeared that the great generality of integer pro- 
gramming comes at a price: neither the cutting plane approach pioneered by 
Gomory, the branch-and-b ound approach first proposed by Land and Doig nor 
any other method proposed in the sixties turned out to be able to solve any but 
the smallest problems within reasonable time. Even today, when linear pro- 
gramming problems with thousands of variables are solved on a routine basis, 
integer programming problems with 80 or 100 variables may already present 
insurmountable problems, 

For a while, optimists could keep hoping that some totally new approach, 
some brilliant fresh idea could provide a breakthrough to a truly efficient inte- 
ger programming method. Computational compleatty theory, however, put an 
end to that illusion in the early seventies, by showing that the computational 
differences encountered in solving integer programming problems are likely to 
be caused by the inherent complexity of the problem and not by the intellectual 
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limitations of the researchers studying it. More precisely, if we associate the no- 
tion of an easy or well-solved problem with the existence of an algorithm whose 
running time increases at most polynomially with problem size, then the gen- 
eral integer programming problem is highly unlikely to be easy in this sense: it 
belongs to a class of notoriously difficult combinatorial optimization problems, 
the NP-hard problems, for which strong evidence suggests that any solution 
method has superpolynomzally increasing running time in the worst case. For 
integer programming, an enumerative approach such as branch and bound, in 
which the (exponentially large) set of feasible solutions to (8.1), (8.2), (8.3) is 
implicitly or explicitly enumerated, provides a good example of such a method. 

That, fortunately, is only part of the story. A more encouraging implication 
of the complexity results mentioned above is that the road to computational 
success for integer programming problems is through the exploitation of spe- 
ctal structure. Methods that solve any integer program are very unlikely to 
be efficient, and in this respect that situation is very different from linear pro- 
gramming. But if specially designed solution methods are used that exploit the 
particular features of the model at hand, then the outlook is much brighter. We 
notice that for certain important subclasses of integer programming problems, 
such as network flow, shortest path and matching problems polynomial time al- 
gorithms have been designed implying that these problems belong to the above 
mentioned class of well-solved problems. Even if the problem in question is not 
easy in the formal sense it still pays to investigate if its special structure allows 
for sharper bounds, faster enumeration schemes or tighter cutting planes. In 
doing so, one may well end up with an enumerative solution method whose 
empirical behaviour is completely satisfactory. 

Much of the above discussion carries over to stochastic integer program- 
ming. From the generic (two-stage) stochastic linear programming problem. 


minimize cz+E(mingy|Wy = Tz — p,y > 0) 
subject to Az=b 
z>0 


where random variables are boldfaced, it is easy to derive the generic (two-stage) 
stochastic integer programming problem: 


minimize ce + E(minqgy|Wy = Tz — p,y >0,y € Z") (8.4) 
subject to Az =b (8.5) 
z>0 (8.6) 
2EZ", (8.7) 


However, since both the general stochastic linear programming problem and, as 
we have seen, the general integer programming problem enjoy a well-deserved 
reputation for computational intractability, so far hardly anybody has been 
tempted to consider methods to solve (8.4), (8.5), (8.6), (8.7) in full general- 
ity. There is no difficulty in principle: one could, for instance, write out the 
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equivalent deterministic program asin Chapter , Section , and solve the 
resulting large integer programming problem, perhaps by exploiting the special 
structure (though in the case of integer programming it is not so obvious how 
to do that). But the resulting method is not likely to be of great computational 
efficiency. 

Many of the difficulties inherent to the general stochastic integer program- 
ming problem already show up when we consider what theoretical features of 
linear programming contribute to the success of stochastic linear programming 
codes. Take, for example, the pleasant properties of parametric linear program- 
ming that lead to convexity properties for stochastic linear programming. A 
small example will already show how much less well behaved parametric integer 
programs can be. Consider the (deterministic) function 


2(z) =1—z2+max{y|0<y<2z, yeZ}. (8.8) 


Its graph is depicted in Figure 8.1, and it shows the peculiar discontinuities and 
nonconvexities that integer programming gives rise to. 


Figure 8.1 


If the integrality constraints appear only at the first stage, then the expected 
optimal second stage costs are still convex in the first stage decision variables 
and the problem can be dealt with by fairly conventional means. The noncon- 
vexities in the two-stage objective function induced by integrality constraints at 
the second stage cause more fundamental problems. Of course, in stochastic in- 
teger programming one usually deals with a weighted sum of ill-behaved second 
stage functions such as (8.8), the smoothing-out effect of which may eliminate 
discontinuity. But convexity or concavity cannot be guaranteed under reason- 
ably general conditions. For instance, let us define 


Z(2) =1—2+El[max{y|0 <y <2+ ye Z}], 


where the random variable # is uniformly distributed over the interval [0, 5] 
with 6 <1. Simple calculations yield that for k = 1,2,... 


ate a LEO k-1S2<h~6 
(2)=) (e—-2)(142)-1, k-8SeSk 
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The graph of Z is depicted in Figure 8.2. Due to the continuity of the distri- 
bution function of 8. Z is a continuous, but still nonconvex function. General 
results on the shape of objective functions of two-stage decision problems are 
derived in Stougie [14]. 


a ee 
Figure 8.2 


So, as in the case of deterministic integer programming, we turn to the ex- 
ploitation of special structure as the last hope for some computational progress. 
Indeed, this is what most of the (limited) research efforts in the area have fo- 
cused on. The above discussion suggests that an appropriate first step should 
be to obtain more insight in the behaviour of the distribution problem solution 
for these specially structured problems, and this has turned out to be an un- 
expectedly fruitful area of research. Natural probabilistic extensions of some 
traditional combinatorial optimization problems turn out to have the surprising 
property that the random variable corresponding to their optimal solution value 
converges in some stochastic sense to a simple analytical function of problem 
parameters when the problem size increases. These results are discussed in more 
detail in Section 8.2, which is devoted to the integer stochastic programming 
distribution problem. 

In Section 8.3, we shall see how results on the distribution problem find 
application in the construction of solution methods for the two-stage decision 
problem. In fact, if the second stage problem is one of those for which an asymp- 
totic closed form for the optimal solution value is known, then it is intuitively 
obvious that a heuristic of good asymptotic properties can be based on using 
the closed form expression in an approximation to the original objective (8.4). 
Results of this nature, together with a brief examination of the possibilities for 
an optimization method in contrast to approzimation (heuristic) methods, can 
be found in Section 8.3. 

By their very nature, the available results on specially structured stochastic 
integer programming problems are to a large extent ad hoc and hence of limited 
general value. We have not attempted to provide an exhaustive survey of the 
area; for that, we refer to Stougie [14] and to the annotated bibliography Karp 
et al., [7]. In fact, we propose to illustrate the nature of the results obtained on a 
very simple but typical stochastic integer programming problem, that might be 
called the machine investment problem. The first stage of this problem involves 
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the acquisition of a certain number of identical machines at. cost ¢ each, subject 
to probabilistic information about the processing times p; (j =1,...,7) of the 
jobs that will have to be executed on these machines in the second stage. The 
objective of the second stage decision is to minimize the makespan (i.e., the 
maximum sum of the processing times assigned to any one machine) of the 
resulting schedule. If we denote the minimum makespan value as a function 
of the number of machines m by C'*(m), then the stochastic program is to 
minimize 


Zn(m) =cem+ EC (m) (8.9) 


where m is constrained to be integer. The computation of C%(m) is itself 
a (NP-hard) combinatorial optimization problem. Thus, this simple example 
incorporates all the features characterizing the collection of stochastic integer 
programming problems that we shall be addressing here. 


8.2 The distribution problem 


As announced in Section 8.1, probabilistic versions of traditional combinato- 
rial optimization problems sometimes have the remarkable property that their 
optimal solution value is asymptotic to a simple function of certain problem 
parameters. 

The machine investment. problem provides a striking example. Recall that 
the second stage corresponds to the minimization of makespan on m machines. 
Specifically, any feasible schedule must satisfy the restrictions that each ma- 
chine processes at most one job at a time and each job is processed during on 
uninterrupted interval of length equal to its processing time. For this NP-hard 
optimization problem enumerative methods provide the only available solution 
tool. We are interested, however, in a probabilistic version of it as it appears 
to the first stage decision maker. Let us assume that the processing times of 
the jobs are independent identically distributed (i.i.d.) random variables with 
expected value yz. Intuition suggests that the minimal makespan for n suffi- 
ciently large will be relatively close to the lower bound achieved by dividing the 
total workload a p; evenly among the m machines. We will show that this 
intuition is correct. For the proof we rely on the above lower bound and on an 
upper bound provided by a heuristic solution of the problem. We assume that 
Ep} < co. For the formal analysis we define the following random model of 
the problem. Let the processing times of a problem with n jobs be the first n 
elements of a random vector drawn from an infinite dimensional sample space 


2. 


The heuristic that we use is a simple list scheduling rule: the jobs are 
placed in an arbitrary fixed order and at each step the next job on the list 
is assigned to the first available machine (see Figure 8.3). Let C# (m) denote 
the makespan under this heuristic for given m and given a realization of the 
processing times. Let L be the latest time that all machines are occupied and 
let job & be completed last. By the nature of list scheduling, L < aa p;/m. 
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Figure 8.3 Illustration of the list scheduling heuristic. 
Problem instance: X = 3,n = 7,p = (1,2, 4,3, 5, 6,7) 


Trivially, py < maxj-—1,....Pj = Dmax- Therefore 


n 
CH (m) < >> pj/m + Pmax: 
J=1 


This inequality combined with the lower bound pee pj;/m on the optimal 
makespan yields 


>> pj/m < Cr (m) < cH (m) < >> p;/m + Pmax- 
= a 


Dividing this by nu /m yields 


n . 
Dj=i Pj mH 41 < Gal) < 
ne nufm ~ nfm np 


H he dp ake 
Cn Calm) - Leja) Bj ae ee mae (8.10) 


Since yp is finite, the strong law of large numbers implies that 


Pr{ lim Pi — nu) [np = 0} = 1. (8.11) 


It remains to prove that 
Pr{ lim MPmax/np = 0} = 1. (8.12) 


We note that the following lemma is proved in a.o. Feller, [8]. 


Lemma 8.1. If Ep? < oo. then 
(i) littp—co Pmax//n = 0 almost surely. 
(ii) limy—oo EPmax/ V7 =0, Oo 
Therefore (8.12) holds for all values of m satisfying m = 0(,/n). (8.10), 
(8.11) and (8.12) together imply that 


Pr{ lim Calm) 


pa vr lee 1} =1. (8.13) 
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Taking expectations in (8.10) and applying Lemma 8.1 (ii) implies that 


lim EC; (m) = 1, 
no nfm 


If we assume that the common distribution function of the processing times 
has a positive derivative in 0, then it is even possible to prove that 


jim, C, (m) ~ LePilm =0 


almost surely, and 
im EC} (m) —np/m =0. (8.14) 


The intuition behind these results is that under the above assumption there are 
enough jobs with very small processing times. These can be used for smoothing 
the differences in the execution times of the machines after having assigned the 
jobs with larger processing times. A rigorous proof of these results is far from 
easy (cf. Frenk and Rinnooy Kan,[{5]). Result (8.13) is particularly illuminating, 
in that it shows how the optimal value of the second stage objective function (cf. 
(8.9)) can be written asymptotically as a simple function of the problem param- 
eters n and y, and of the first phase decision variable m. Even more pleasantly, 
we have seen that there exist simple scheduling heuristics whose solution values 
are also asymptotic to this same function. Similar asymptotic results are avail- 
able for many other combinatorial problems. They can be broadly divided into 
three classes, in accordance with the different underlying probabilistic models. 


{s) Number problems 


Here, randomness occurs in certain numerical parameters; typically, these are 
assumed to be i.i.d. random variables. The above scheduling problem provides 
a good example. The linear assignment problem 


n n 
minimize ) ) ay j2ij 
i=1 j=1 


n 
subject to Yo ay =1 (j=1,...,n) 


i=1 
n 
oxy =1 (¢= 1,...,7) 
j=l 
aziz {0, 1} 
where the weights a,;; are i.id.,is another one. Here, one can show under quite 


general conditions on the common distribution function F that the expected 
optimal solution value is asymptotic to nF ~'(1/n) Frenk and Rinnooy Kan, 
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[5]. The proof amounts to showing that, with high probability of being correct, 
we may set all 2;; equal to zero, except those corresponding to the smallest 
weights for each 7 and each 7. Thus with probability approaching 1 we obtain a 
feasible solution whose value is asymptotic to » times the value of the smallest 
order statistic; i.e., nF '(1/n). 

A third example is provided by the knapsack problem 


n 
maximize Yo 52; 
j=l 
n 
subject to > aja; < bn 
f=1 
az; € {0,1}. 


If the ¢; and a; are independent and uniformly distributed on [0,1), then the 
expected optimal solution value is asymptotic to nv/2b]3 fdo<bc< b and 
n(—30?+86+1)ifi<b< i (if > 4, then asymptotically all z; are equal to 
1 with probability 1) Meanti et al.,[12]. To derive this result, one shows that 
the optimal solution value is asymptotic to the value of the linear relaxation, 
which is equal to nmin, {L,(4)|A > 0} where 


i< , 
Ln(A) = max{A + le — \a;)2j|0 <2; $< 1(7 =1,...,n)} 
j= 


= A\b+ 1 — \a;)x;(A) 


j=l 


ith 
wit x,(4)= {1 ife; — Aa; 20 
J 0 otherwise. 


The strong law of large numbers implies that L,(A) converges almost surely to 
L(d) = 46+ Eeix;(\) — AB a,x; () 


and results from convex analysis can be used to show that miny{L,(A)|\ > 0} 
converges (almost surely and in expectation) to the unique minimum of L(X). 
Elementary computations then yield a closed form expression for L(A) and 
through that the above result. As in most cases, this result is accompanied 
(and, indeed, derived through) a simple heuristic whose error disappears asymp- 
totically. 


(tt) Euclidean problems 


These problems can be formulated with respect to n points in the Euclidean 
plane; their probabilistic version then amounts to assuming these points to be 
distributed uniformly over (say) the unit square. The most famous example is 
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the traveling salesman problem of finding the shortest tour through the points. 
The optimal solution value is asymptotic to #,/n with probability 1 (Steele, 
[13]), where in the case of the unit square ? ~ 0.765. The proof of this result is 
extremely complicated, although it is not hard to appreciate the proportionality 
to Jn intuitively: for large n, the optimal tour through 4n points in a 4 x 4 
square is 4 times as Jarge as the optimal tour through n points in a 1 x 1 
square, and scaling down the 4 x 4 square to a 1 x 1 one reduces the length by 
a factor of 2. Here, a simple space partitioning heuristic (Karp, [7]) achieves a 
solution value that is asymptotic to the optimal one. In one version, the unit 
square is partitioned into s(n) equal size subsquares Q;. The optimal local tours 
T*(Q;) are computed and s(n) points, one selected from each Q,, are linked 
by a single global tour. This yields a Euclidean walk through all points which 
can be easily transformed into a tour of no greater length. For the analysis, 
ons shows that £,7"(Q;) exceeds the optimal tour length by no more than 3 
times the sum of the perimeters of the Q;, which is an 0(1/e(7)) term. There 
are various ways to construct the global tour, so that its length adds no more 
then 0(\/e(n)) to the absolute error again. Then, by taking s(n) = o(n) and 
invoking the above result, one sees that the relative error converges to 0 almost 
surely. Similar results, combined with similar heuristics, have been obtained 
for Euclidean location problems (cf. Zemel, [15]) and routing problems (cf. 
Marchetti Spaccamela et al.,[10] and Haimovich and Rinnooy Kan, [6]). For 
an overview, see Karp et al., [7]. 

(vit) Graph problems 

Two natural models for random graphs, one in which each edge is present 
with probability p, and one in which m edges are scattered uniformly among 
the vertex pairs, provide the context for probabilistic versions of combinatorial 
optimization problems defined on graphs. The mazimum clique problem of 
finding the size of the largest complete subgraph, is a particularly fine example: 
under the first probabilistic model, the maximum clique size is asymptotic to 
2Inn/in(1/p) (Matula, [11]). For results on other graph problems we refer 
once again to Karp et al., [7]. 


In all the above cases (with the exception of the linear assignment prob- 
lem), it would be virtually impossible to solve every large instance of these 
optimization problems to optimality. Thus, if one wishes to have an exact so- 
lution to the distribution problem, this can only be achieved for small problem 
sizes. Given the parametric character of the necessary computations, dynamic 
programming is a natural tool to consider; as we shall see in the next section, 
it can occasionally be applied with reasonable success. 

The asymptotic character of all the above results is one of their least attrac- 
tive features. For some of them (in particular the number and graph problems), 
speed of convergence results provide additional information about the rate at 
which the objective function value converges to its limit. Especially for the 
Euclidean problems, however, such results are notoriously lacking. 
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8.3 Multi-stage decision problems 

For the solution of integer multi-stage decision problems, the difficulty of which 
has been discussed extensively in Section 8.1, heuristics can be designed in 
which results of the type presented in Section 8.2 play a central role. When an 
asymptotic characterization of the optimal value of the second stage problem 
has been derived as in (8.13), it can be used as part of an estimation of the 
overall two-stage objective function. As mentioned in Section 8.2, this estimate 
is frequently a simple function of the first stage decision variables. Its mini- 
mization yields the heuristic first stage decision. We then require a heuristic 
for solving the second stage problem, which is usually a NP-hard one. One 
would like this heuristic to provide solutions of such quality that strong asymp- 
totic optimality properties of the whole heuristic procedure are guaranteed. 
Fortunately, simple heuristics frequently turn out to be good enough for these 
purposes. 

The combination of a good estimate of the total cost and a good approx- 
imation procedure for the solution of the second stage problem can be shown 
rigidly to yield a guarantee for the asymptotic optimality of the resulting sto- 
chastic integer programming heuristic. More specifically, the relative error of 
the heuristic, obtained by dividing the difference between the heuristic value 
and the optimal value of the problem by the optimal value, can be shown to 
converge stochastically to 0 with increasing problem size for a very general class 
of models. 

We illustrate the above ideas again with the example of the machine in- 
vestment problem. The asymptotic characterization (8.14) of the optimal value 
of the second stage scheduling problem allows us to estimate the overall cost of 
the two-stage decision problem by the function 


ny 
Zh, (m) =emt+ ae 


Minimization with respect to m of this unimodal convex function, subject to the 
restriction that m is integral, produces a heuristic first stage decision m1 equal 
to |/nufc] or to [\/nz/c], depending on which of these two values is more 
favorable. For the solution of the second stage scheduling problem, we have 
seen in Section 8.2 that the list scheduling rule yields a relative error that tends 
to 0 almost surely if m = 0(,/n). We note that mil wz Vnu/e. Therefore, if 


bs 2(m) denotes the makespan produced by the heuristic, the above makes it 
easy to verify that 
i emi + ECh? (m#1) 
im ———,, —_ 
= Hy Ld 
oe ay 


which establishes asymptotic optimality of the heuristic procedure as a whole. 
A detailed description of the above result is given in Dempster et al., [2]. 

We can also compare the heuristic solution value with the optimal solution 
value of the machine investment problem under the assumption that all infor- 
mation is available in advance, This problem can be formulated as finding a 


=1, 
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function m?, : Q — IN that gives for each realization of the random processing 
times a value m? for which 


em’, +03 (m3) = min {em + C3 (m)}. 


The reasoning used for the justification of result (8.13) makes it easy to verify 
that almost surely 

_ emB1 + C#a(m#1) 

noo emg +Ca(ms) 


This result implies that the relative error that can be attributed to imperfect in- 
formation also tends to zero almost surely. This strong property of the heuristic 
was named asymptotic clairvoyance in Lenstra et al., [9]. 

In a similar way, heuristics of equal quality can be constructed for other 
two-stage decision problems of which the second stage problem allows asymp- 
totic characterization of the optimal solution, such as vehicle routing problems 
(Marchetti Spaccamela et al., [10]) and location problems (Stougie, [14]) that 
are preceded by an investment decision. For instance, in the vehicle routing 
case, the objective is to minimize the sum of the cost of acquiring m vehicles 
at cost c each and the expected length of the longest route subsequently taken 
by any of the m vehicles to serve n customers from a common depot. By un- 
derestimating the latter by the expected cost of the shortest traveling salesman 
tour through all n customers divided by m (i.e. P\/n/m), we again arrive at 
an asymptotically optimal heuristic. In the location problems the objective is 
to minimize the sum of establishing m depots and the expected sum of the 
distances from each of n customers to the nearest depots. A general framework 
for the design and analysis of such stochastic integer programming heuristics is 
presented in Lenstra et al., [9]. 

The remaining part of this section will be dedicated to optimization meth- 
ods for stochastic integer programming. Only few results are available in this 
direction. Such methods have been designed for some two-stage decision prob- 
lems, of which the stochastic parameters are assumed to have discrete distri- 
butions with only a small number of points with positive density. It is not 
surprising that the parametric relations between the various feasible solutions 
of these problems can efficiently be exploited by dynamic programming routines. 
This can again be illustrated through the machine investment problem. Let us 
assume that the processing times of the jobs can have only & possible values, 
@,...,a%. Let C*(m,n1,...,x) be the minimum makespan of a set. of jobs 
consisting of n; jobs with processing times a4; (i =1,...,4). Any schedule and 
therefore also the optimal one can be split into a schedule of a subset of the jobs 
on a subset of the machines and the rest of the jobs on the rest of the machines. 
Based on this observation we derive the following recurrence relations: 


k 
C*(1,71,..., 2) = nay 


f=1 
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and, for every m! satisfying 1 < m' < m, 


CO" (m,n, +06, Me) = 
Pena max{C*(m’', £1, wee »fe),C*(m = m,n} = £i,.. oy Mk — £x)} 
be0, Say 


(8.15) 

We can evaluate the objective function of the two-stage decision problem 
for each interesting value of m by computing the minimum makespan for each 
possible composition of the set of n jobs, weighing them with the corresponding 
probabilities, taking the weighted sum and adding em. Solving the machine 
investment, problem is now just a matter of selecting the minimum value. 

This algorithm has a running time that, interestingly enough, is polynomial 
in the number n of jobs. It is, however, exponential in the number & of possible 
values of the processing times. This can be seen by setting m’ in (8.15) equal 
to 1 and evaluating C*(m,ni,...,m) for all values of m ranging from 1 to n, 
and for all values of n1,...,n% satisfying ae ny <n. These are O(n*t) 
evaluations, each of which requires the solution of equality (8.15), which can 
be achieved by the comparison of O(n*) values, Hence the overall running time 
is O(n?*+1). Obviously better versions and implementations are possible, and, 
in fact, the best running time bound that has been achieved is O{n?F-} log 7) 
(Lageweg et al., [8], elsewhere in this book). Therefore already for small values 
of k, only problems with a limited number of jobs can be solved. 

In Lageweg et. al., [8] the above dynamic programming routine is described 
in detail and tested. In the same paper similar routines are presented for a 
capital budgeting problem and for a hierarchical bin packing problem, for which 
at the first stage one has to decide upon the capacity of bins, in which items 
have to packed in the second stage, such that a minimum number of bins is 
required. In the location problems the objective is to minimize the sum of 
establishing m depots and the expected sum of the distances from each of n 
customers to the nearest depot. The computational results not only showed 
that the above dynamic programming routines do work satisfactorily but also 
yielded some insights into the shape of objective functions of integer two-stage 
decision problems involving discrete distributions. These confirmed earlier theo- 
retical insights that were based on parametric analyses of deterministic integer 
programming problems (cf. Blair et al., [1]). To seek alternatives to these 
simple dynamic programming routines is but one of the many challenges that 
remain in the area of stochastic integer programming, an area which is only 
now starting to receive the attention that it deserves. 
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Implementation 


CHAPTER 9 


A PROPOSED STANDARD INPUT FORMAT FOR 
COMPUTER CODES WHICH SOLVE STOCHASTIC 
PROGRAMS WITH RECOURSE 


J. Edwards 


Abstract 


We explain our suggestions for standardizing input formats for computer codes 
which solve stochastic programs with recourse. The main reason to set some 
conventions is to allow programs implementing different methods of solution 
to be used interchangeably. The general philosophy behind our design is a) to 
remain fairly faithful to the de facto standard for the statement of LP prob- 
lems established by IBM for use with MPSX and subsequently adopted by the 
authors of MINOS, b) to provide sufficient flexibility so that a variety of prob- 
lems may be expressed in the standard format, c) to allow problems originally 
formulated as deterministic LP to be converted to stochastic problems with a 
minimum of effort, d) to permit new options to be added as the need arises. 


9.1 Introduction 


In the latter half of 1984, the Adaptive Optimization project of the Systems 
and Decision Sciences program at the International Institute for Applied Sys- 
tems Analysis collected a number of computer programs written to solve various 
problems in stochastic programming. Our goal was to organize these codes so 
that they might be distributed on magnetic tape to researchers, who might 
benefit from having several algorithms with which to experiment. However, we 
came to realize that. the process of tinkering with the various methods will be 
greatly complicated because each program has its own format for input data. 
We therefore developed a standard input format for stochastic programs with 
recourse. To encourage and simplify its use, we based it on the input format 
developed by IBM for the extended Mathematical Programming Subsystem 
(MPSX) [1] and adopted by the authors of the Modular In-core Nonlinear Op- 
timization System (MINOS) [3] and we wrote a number of low level subroutines 
to read files written in the standard format. 
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9.2. The Problem 


The general form of the stochastic program with recourse is taken to be 
minimize cz + Q(z) 


subject to A 


IV IL IA 
oa 


l<a2e<su 


where 


Q(z) = Ey {min a(y, x)T(x)e + W(x)y = (x)}, 


x and y denote the decision and recourse variables, respectively, x denotes an 
event, T(x) and W(x) denote the technology and recourse matrices, respec- 
tively, and E, denotes expectation. In subsequent references to the technol- 
ogy matrix, the recourse matrix, the stochastic right hand side, p{x), and the 
penalty function, ¢(y, x), we omit the arguments y and x. 


9.8 Organization of the Data: Control, Core, and Stochastics Files 


The data required by a program written to solve the stochastic program in 
(1) can be divided logically into three files: a control file, a “core” file, and a 
“stochastics” file. Roughly speaking, the control file contains any data partic- 
ular to the program and the core and stochastics files contain the data that 
define the problem. 

As its name implies, the control file contains any information that is used 
to guide the execution of the program. For example, the control file might in- 
clude a limit on the number of steps permitted and a tolerance for convergence 
if the algorithm implemented in the program were iterative in nature, file name 
and unit number assignments if the program required several files, or upper 
limits on the amount of storage needed if the program allocated array space 
“dynamically”. The control file also contains any information that must be read 
before the program profitably can read the contents of the matrices and vectors 
that appear in the problem, e.g., the dimensions of those structures. Because 
the contents of the control file depend heavily on the algorithm employed and 
the manner in which it is implemented, we have not included a standard format 
for control files. Indeed, the rigid structure of the format we propose (partic- 
ularly its strict use of specific columns as field delimiters) makes it unsuitable 
for application to files whose contents are liable to change frequently. 

The core file contains the bounds on the decision vector, z, the contents of 
the matrices by which it is multiplied, and the contents and ranges of the rows 
of the deterministic right hand side vector, 6. The core file for a stochastic LP 
thus corresponds in large measure to the data file that MPSX or MINOS would 
require to solve the equivalent nonstochastic LP (i.e., the same problem with 


Q(z) removed). 
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The stochastics file defines the technology matrix, T, the distribution of 
the rows of the stochastic right hand side vector, p, the contents of the re- 
course matrix, W, and the function g. We have chosen to partition the input in 
this fashion so that a problem originally forrnulated as a linear program and ex- 
pressed in standard MPSX format may be augmented later by a stochastics file, 
thereby permitting certain elements (e.g., the right hand side) to be stochastic. 


9.4 Overview of the Standard Input Format 


The proposed format is quite similar to the MPSX format, which is described 
on pages 199 through 209 of [1], although there are some differences. As in the 
MPSX format, each data file contains a number of sections, some of which are 
optional. A “header line” {or “header”) marks the beginning of each section*. 
Most sections contain data lines. A data line is divided into six fields, some 
of which may be empty. Specific columns delineate field boundaries. There 
are three name fields, two numeric fields, and a code field. The columns that 
constitute these fields are 


— columns 2 and 3: code field 

— columns 5 through 12: first name field 

— columns 15 through 22: second name field 

-- columns 25 through 36: first numeric field 

- columns 40 through 47: third name field 

— columns 50 through 61: second numeric field 


{all column ranges are inclusive). Comment lines contain an asterisk {*) in the 
first column and may appear anywhere. 

Unlike the MPSX format, names may contain imbedded blanks or leading 
blanks (although this last is not recommended). The contents of the name 
fields are interpreted as character strings, so names may begin with a digit. All 
lower case letters in the code and name fields are translated to their upper case 
equivalents. Values in the numeric fields must contain a decimal point. The 
MPSX convention concerning comments following a dollar sign ($) in the first 
column of the second or third name fields has not been adopted as part of the 
standard format. 

Following are descriptions of each of the data files. Each description con- 
tains a list of the sections that constitute the corresponding data file. These 
sections must appear in the data file in the same order as they appear in the 
list, although sections marked “optional” need not appear at all. 


* {[1], p. 199) uses the term “indicator card” rather than header. 


218 Stochastic Optimization Problems 


9.5 The Core File 
The core file specifies 


~ the linear portion of the objective, c, 

— the contents of the constraint matrix, A, and possibly the contents of the 
technology matrix, T, and of the recourse matrix, W, 

the deterministic right hand side, 6, 

the bounds on the decision vector, z, and 

— the ranges on the right hand side. 


The core file contains the following sections: NAME, ROWS, COLUMNS, RHS, 
RANGES, BOUNDS, and ENDATA. These sections assume more or less the 
same role in the standard format as they do in the MPSX format. Therefore, 
we give only an abbreviated description of these sections and note differences 
between the standard format and the MPSX format. 


(1) NAME - This is an informative header line (the section contains no data 
lines). The user may enter any characters desired in columns 15 through 
72 (the MPSX format restricts names to eight alphanumeric characters). 

(2) ROWS - As in the MPSX format, this section specifies the names of the 
rows of A, the name of the row in the COLUMNS section that contains the 
elements of c, and the type of constraint (equality or inequality) represented 
by each row. In some cases, this section also specifies the names of the rows 
of T. Rows formed by a linear combination of two other rows (type “D” 
rows) and scaling of rows (use of the ““SCALE”’ keyword) are supported 
in the MPSX format but are not permitted in the standard format. 


(3) COLUMNS - As in the MPSX format, this section specifies the names of 
the columns of A and of c and contains the values of the nonzero elements 
of A and of c. In some cases, this section also specifies the names of the 
columns of W, contains the nonzero elements of W, and/or contains the 
nonzero elements of T. Scaling of columns (use of the “‘SCALE”’ keyword) 
is supported in the MPSX format but is not permitted in the standard 
format. 

(4) RHS - This section specifies the names of the rows of 6 and contains 
the values of the nonzero elements of 6. This section is identical to its 
counterpart in the MPSX format. 

(5) RANGES (optional) - This section specifies the ranges on the rows of b. 
This section is identical to its counterpart in the MPSX format. 


(6) BOUNDS (optional) - This section specifies the bounds on the rows of 
the decision vector, z. This section is identical to its counterpart in the 


MPSX format. 


(7) ENDATA - This line marks the end of the core file (the section contains 
no data lines) and is identical to its counterpart in the MPSX format. 
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9.6 The Stochastics File 
The stochastics file specifies 


— the contents of the technology matrix, T, 

— the distribution of the stochastic right hand side, p, 
— the contents of the recourse matrix, W, and 

— the form of the penalty function, g. 


The stochastics file contains the following sections: NAME, TECHNOLOGY, 
DISTRIBUTIONS, RECOURSE, OBJECTIVES, and ENDATA. After the OB- 
JECTIVES section additional sections may appear containing data particular 
to a given algorithm. A program should read only those sections it needs from 
the file and should ignore the rest. 

Most sections may take one of several forms, and the user must enter the 
name of one of them beginning in column 15 of the header line. A description 
of each of the sections, the forms they may assume, and their contents follows. 


(1) NAME - This is an informative header line (the section contains no data 
lines). The user may enter any characters desired in columns 15 through 
72. 


(2) TECHNOLOGY - This section specifies the contents of T. The section 
may take one of the forms whose names follow: 


DETERMINISTIC (the elements of T follow) - The technology matrix is given 
by the data following the section header. The format of the data is identical 
to that of the COLUMNS section of the core file, i.e., the contents of the 
matrix are specified in column order. The first name field on a line (columns 
5 through 12) contains the name of the column. The remaining name/numeric 
field pairs (columns 15 through 22/25 through 36 and 40 through 47/50 through 
61) specify a row name and the contents of the matrix at the position given by 
the row and column names. The row names form a subset of the row names in 
the ROWS section of the core file. 


CORE (the elements of T appear in the core file) - The data consists of a list 
of names which form a subset of the names specified in the ROWS section of 
the core file. The contents of these rows (as specified in the COLUMNS section 
of the core file) constitute the technology matrix. One name appears per line, 
in the first name field (columns 5 through 12). 


STOCHASTIC (the elements of T are supplied by a subroutine) - The data 
consists of a list of the names of the rows of the technology matrix. Each 
row name has associated with it one or more column names. The column 
names specify the active columns within the given row and forrn a subset. of the 
column names specified in the COLUMNS section of the core file. The values 
for the technology matrix do not appear in either data file but are supplied by 
a subroutine written by the user. The row names appear in the first name field 
of a line (columns 5 through 12) and the other two name fields (columns 15 
through 22 and 40 through 47) are available for the column names. 
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NONE (no data) - There is no data. The user must decide where and how to 
obtain the necessary values. 


(3) DISTRIBUTIONS - This section specifies the distribution of the rows 
of p. The section may take one of the forms whose names follow: 


DISCRETE (each row is independently distributed) - Each row of p may take 
one of a fixed number of values. The data for this form consists of a number 
of “definitions”, which are analogous to the “vectors” in the RANGES and 
BOUNDS sections of the core file (see [1]). Each definition specifies the distri- 
bution of every row of p and consists of a number of sets of entries of the form 
“defname rowname value probability”. Within a given definition, there is one 
such set for each of the rows named in the TECHNOLOGY section. “defname” 
is the name of the definition to which the entry belongs; it occupies the first 
name field on a line (columns 5 through 12). “rowname” is the name of the row 
associated with the entry; it occupies the second name field on a line (columns 
15 through 22). “value” and “probability” are a value for the row and its like- 
lihood, respectively. They occupy the first and second numeric fields (columns 
25 through 36 and 50 through 61), respectively. 


The sum of the probabilities for a given row must be unity. The values specified 
for a given row must be distinct. Entries for different rows or different definitions 
must not be mixed together in the input file. 


As an example, let the T matrix have two rows, TROW1 and TROW2, and 
define two distributions for the rows of p as follows: 


1 0.4 

Row 1 = : with probability a (1) 
4 0.2 
8 0.6 

Row 2 = ¢ 9 with probability { 0.3 
0 0.1 

and 
Row 1 = i with probability { TE (2) 


Row 2 = 2 with probability 1.0. 


The contents of the name and numeric fields for these distributions are shown 
in Table 9.1. The user specifies which is the desired definition (our definition 
names “DIST1” and “DIST2” were chosen arbitrarily) when the appropriate 
input utility is called. Note that every value contains a decimal point. 
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Table 9.1 Contents of a sample DISCRETE DISTRIBUTIONS section. 
















2.0 
| 90 | 
TROW2 
TROW1 
pists | Taows [20 


SIMULATION (the rows are supplied by 2 subroutine) - There are no data lines 
in this case. The program obtains its values from a subroutine written by the 
user. 


PIECEWISE (piecewise constant pdf) - Each row of p takes a value within 
one of a finite number of ranges. Within a range, all values are equally likely. 
However, within a set of ranges, all ranges are not equally likely. The data 
for this form consists of a number of “definitions”, which are analogous to the 
“vectors” in the RANGES and BOUNDS sections of the core file (see {1]). Each 
definition specifies the distribution of every row of p and consists of a number 
of sets of entries of three lines each. Within a given definition, there is one such 
set for each of the rows named in the TECHNOLOGY section. Each three line 
entry within a set describes a range for the row associated with the set. The 
first line in an entry contains the letters “PC” in the code field (columns 2 and 
3), the name of the definition to which the entry belongs in the first name field 
(columns 5 through 12), the name of the row with which this range is associated 
in the second name field (columns 15 through 22), and the probability that the 
row takes a value within the range in the first numeric field (columns 25 through 
36). The second and third lines in an entry specify the upper and lower bounds 
of the range. For both bounds, the code field contains the letters “BD”, the 
first name field contains the name of the definition to which the entry belongs, 
the second name field contains the name of the row with which the range is 
associated, and the first numeric field contains the bound value. 


The sum of the probabilities for the ranges for a given row must be unity. 
Entries for different rows, different ranges, or different definitions must not be 
mixed together in the input file. 


As an example, let the T matrix have two rows, TROW1 and TROW2, and 
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define two distributions for the rows of p as follows: 
(1, 2} 


now tin {4 


with probability { ee 


5,7 0.7 
Row 2 in ¢ (1,3] with probability ¢ 0.1 


(0, 1] 0.2 

and 
: [2, 4] : ye 0.5 
Row 1 in { (5, 9] with probability 0.5 


(1) 


(2) 


The contents of the code, name and numeric fields for these distributions are 
shown in Table 9.2. The user specifies which is the desired definition (our 
definition names “DIST1” and “DIST2” were chosen arbitrarily) when the ap- 
propriate input utility is called. Note that every value contains a decimal point. 
SCENARIOS (the value of p is defined by a sample of vectors) - The p vector 
may take one of a finite number of values. The data for this form consists of a 


Table 9.2 Contents of a sample PIECEWISE DISTRIBUTIONS section. 


Second First 
ame Numeric 


Field Field 
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number of “definitions”, which are analogous to the “vectors” in the RANGES 
and BOUNDS sections of the core file {see [1]}. Each definition provides a 
sample of vectors and consists of sets of entries giving a value for p and the 
probability that p takes that value. The first line in each entry contains the 
letters “SC” in the code field (columns 2 and 3), the name of the definition 
to which the entry belongs in the first name field (columns 5 through 12), a 
name identifying the scenario in the second name field (columns 15 through 
22), and the probability that p takes the value associated with this scenario in 
the first numeric field (columns 25 through 36). Subsequent lines specify the 
values that the rows of p assume under the scenario. There must be one of 
these lines for each row named in the TECHNOLOGY section. The code field 
of these lines contains the letters “RV”, the first name field contains the name 
of the defninition to which the entry belongs, the second name field contains 
the name of the row whose value the line specifies, and the first numeric field 
contains the value. 


The sum of the probabilities for the scenarios in a given definition must be 
unity. Entries for different scenarios or different definitions must not be mixed 
together in the input file. 


As an example, let the T matrix have two rows, TROW1 and TROW2, and 
define two distributions of the vector p as follows: 


(12] 0.5 
Vector = 4 [34] with probability ¢ 0.3 (1) 
[56] 0.2 
and 
Vector = ies with probability fe (2) 


The contents of the code, name and numeric fields for these distributions are 
shown in Table 9.3. The user specifies which is the desired definition (our defini- 
tion names SAMP1 and SAMP2 were chosen arbitrarily) when the appropriate 
input utility is called. The scenario names SCEN1, SCEN2, and SCEN3 where 
chosen arbitrarily. Note that every value contains a decimal point. 
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Table 9.8 Contents of a sample SCENARIOS DISTRIBUTIONS section. 


First 

Code Name 

Field Field 
SC 








NONE (no data) - There is no data. The user must decide where and how to 
obtain the necessary values. 


(4) RECOURSE. This section specifies the contents of W. The section may 
take one of the forms whose names follow: 

SIMPLE {simple recourse) - There are no data lines in this case. The recourse 
matrix is assumed to be [J,—J], where I has rank equal to the number of rows 
in the technology matrix. 

DETERMINISTIC {the elements of W follow) - The recourse matrix is given 
by the data following the section header. The format of the data is identical 
to that of the COLUMNS section of the core file, i.e., the contents of the 
matrix are specified in column order. The first name field on a line (columns 
5 through 12) contains the name of the column. The remaining name/numeric 
field pairs (columns 15 through 22/25 through 36 and 40 through 47/50 through 
61) specify a row name and the contents of the matrix at the position given by 
the row and column names. The row names form a subset of the row names in 
the TECHNOLOGY section. 

CORE (the elements of W appear in the core file) - The data consists of a list 
of names which form a subset of the column names specified in the COLUMNS 
section of the core file. The contents of those columns (as specified in the 
COLUMNS section of the core file) constitute the recourse matrix. One name 
appears per line, in the first name field (columns 5 through 12). 
STOCHASTIC (the elements of W are supplied by a subroutine) - The data 
consists of a list of the names of the rows of the recourse matrix. Associated 
with each name is one or more column names. These column names specify the 
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active columns within the given row and form a subset of the column names 
specified in the COLUMNS section of the core file. The values for the recourse 
matrix do not appear in either data file but are supplied by a subroutine written 
by the user. The row names appear in the first name field of a line (columns 
5 through 12) and the other two name fields (columns 15 through 22 and 40 
through 47) are available for the column names. 


NONE (no data) - There is no data. The user must decide where and how to 
obtain the necessary values. 


(5) OBJECTIVES - This section specifies the form of g. The section may 
take one of the forms whose names follow: 


LINEAR (g is a linear function) - The recourse objective is given by qg(y) = 
qy, where q is given by the data following the section header. The data for 
this form consists of a number of “definitions”, which are analogous to the 
“vectors” in the RANGES and BOUNDS sections of the core file (see [1}). 
Each definition specifies the elements of g and consists of entries of the form 
“defname name value”, where “defname” is the name of the definition to which 
the entry belongs, “name” is the name of a column of W (or of a row of T; 
see below) and “value” is the value for the corresponding row of g. “defname” 
occupies the first name field on a line (columns 5 through 12), “name” occupies 
the second name field (columns 15 through 22) and “value” occupies the first 
numeric field (columns 25 through 36). 


Entries for different definitions must not be mixed together in the input file. 


As an example, let the W matrix have two columns, WCOL1 and WCOL2, and 
define two vectors q as follows: 


q = [79] (1) 


and 

q = [33] (2) 
The contents of the name and numeric fields for these vectors are shown in 
Table 9.4. The user specifies which is the desired definition (our definition 
names “VEC1” and “VEC2” were chosen arbitrarily) when the appropriate 
input utility is called. Note that every value contains a decimal point. 


Table 9.4 Contents of a sample LINEAR OBJECTIVES section. 





First 
Name 
Field 


VECI1 
VECI1 
VEC2 
VEC2 











PIECEWISE (gq is two-piece linear) - The recourse objective is assumed to be 
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two-piece continuous about zero, i.e. 


aly) = a gi (yi) with gj (ys) = lg ‘ ; 


The data for this form consists of a number of “definitions” , which are analogous 
to the “vectors” in the RANGES and BOUNDS sections of the core file (see 
{1]). Each definition specifies the values of q and g; for all z and consists of 
entries of the form “defname name value value”, where “defname” is the name 
of the definition to which the entry belongs, “name” is the name of a column 
of W (or of a row of T; see below), the first value gives the corresponding value 
of gt, and the second value gives the corresponding value of g~. The names 
occupy the first and second name fields on a line (columns 5 through 12 and 15 
through 22) and the values occupy the first and second numeric fields (columns 
25 through 36 and 50 through 61). 


Entries for different definitions must not be mixed together in the input file. 


As an example, let the W matrix have two columns, WCOLI and WCOL2, and 
define two vectors g as follows: 


—2 <0 —3 <0 
Ge Ws »¥2S (1) 
5, yi $0 7, y2 <0 
and 
—5, 41 $0 —9, y2 <0 
q= (2) 
3,41 $0 2, y2 <0 


The contents of the name and numeric fields for these vectors are shown in 
Table 9.5, The user specifies which is the desired definition (our definition 
names VEC1 and VEC2 were chosen arbitrarily) when the appropriate input 
utility is called. Note that every value contains a decimal point and that the 
values of Al are positive. 


Table 9.5 Contents of a sample OBJECTIVES (PIECEWISE) section. 


First Second First Second 

Name Name Numeric Numeric 

Field Field Field Field 
2.0 5.0 





NONE (no data) - There is no data. The user must decide where and how to 
obtain the necessary values. 


Note - if the recourse matrix is simple (i.e., if there are no column names for W), 
row names of T are substituted for column names of W in the OBJECTIVES 
section. 
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(6) ENDATA - This line marks the end of the stochastics file (the section 
contains no data lines). 

It is clear that we have covered only a few of the possibilities for most of the 
above sections. However, the format is such that new forms can be added as 
the need arises. 
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CHAPTER 10 


A COMPUTER CODE FOR SOLUTION OF 
PROBABILISTIC-CONSTRAINED STOCHASTIC 
PROGRAMMING PROBLEMS 


T. Szantai 


10.1 Introduction 


The theory of logarithmic concave measures was developed by A. Prékopa {1, 3}. 
Due to this theory it became possible to handle joint probabilistic constraints 
in the stochastic programming problems. These constraints are of the form 


P(djz > fi, t=1,...,8) 2p, (10.1) 


where the random variables §;,..., 8, have a logconcave joint distribution. For 
the calculation of the probability value (10.1) one can apply multi-dimensional 
integration techniques. Unfortunately these methods have an extremely slow 
convergence in higher dimensions. In these cases only Monte Carlo methods are 
applicable and this is the reason why the probabilistic-constrained stochastic 
programming problems of this type can not be solved efficiently by standard 
nonlinear programming codes. In the last ten years many test problems have 
been solved and many real applications have been worked out. Al} of these 
works required development an individual computer code suitable for the special 
problem to be solved. 

In this paper we give a short description of a computer code which intends 
to solve a relatively wide class of probabilistic-constrained stochastic program- 
ming problems. In the last section we also give the results for some simple test 
problems. 

The computer code is contained in the collection of experimental computer 
codes assembled by the Adaptation and Optimization (ADO) project of the 
Systems and Decision Sciences (SDS) program at the International Institute 
for Applied Systems Analysis (IIASA). This collection is available on computer 
tape to researchers. The tape contains a User’s Manual for each program as 
well. 
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10.2 The Solution Method 


We solve probabilistic-constrained stochastic programming problems of the 


form 
mimumize ¢) 2) +::: + ent, 


subject to Az=b 
z>0 
and P(DX > #)>p, 


(10.2) 


where A is a known m Xn matrix, D is a known ¢ X17 matrix, ) is known and of 
the appropriate dimension, p is the prescribed probability level, and @,,...,8, 
have joint normal probability distribution with expected values 


E(A) = H1,++», (Be) = He; 


with variances 
D? (81) = ree ..,D? (Bs) = Oe 


and with the correlation matrix 


1 F12 Tle 

r 1 t 
R= 21 28 

Tel Te2 1 


In problem (10.2) the linear constraints may include inequalities as well and 
explicit upper bounds on the variables can be specified. 

For the solution of problem (10.2) we apply Veinott’s supporting hyper- 
plane algorithm. This algorithm solves general nonlinear programming prob- 
lems and it is especially practical when the problem has just one nonlinear 
constraint above the possibly large number of linear constraints. A complete 
description of the algorithm is given in Veinott [4]. Here we give only details 
which are related to the stochastic feature of the problem. 

To obtain a starting point in the interior of the feasible domain one can 
solve the linear programming problem 


n 
minimize Yo (din ty +...+dintn — wi) / oi 
t=1 
subject to Az =b (10.3) 
Dze>ptto 
z>0 


where d;; is the element of D in the i-th row and j-th column and ¢ is a 
constant. The value of parameter ¢ should be chosen based on the desired 
probability level, p. For high probabilities the value 3 is recommended. If the 
optimal solution of the linear programming problem (10.3) turns out not to be 
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an interior point of the feasible domain, i.e. it does not satisfy the probabilistic 
constraint, one should try to solve problem (10.3) with a larger value of the 
parameter t. Of course when choosing a relatively large parameter value ¢ it 
may be that the linear programming problem (10.3) will not have any feasible 
solution. Experience shows that the selection of an appropriate parameter value 
is not difficult. 

To obtain a starting point outside the feasible domain, the program solves 
the linear programming problem 


minimize Az=6 


10.4 
subject to 2 >0 (104) 


In the case of an unbounded objective, one must provide additional constraints 
on the variables which do not disturb the probabilistic constraint. 

To find the boundary point of the probabilistic constraint at each iteration 
we use an interval bisection algorithm with a sophisticated stopping rule. Let 
denote x; the actual point in the interior of the feasible domain and zy; the 


in 
point outside the feasible domain. We want to determine the value \ for which 


2 =Zout + A\(in - Zout)> 0O<A<1 


and 
P(d;2y = Bi t=1,...,8) =p. 


In an earlier paper (see Szantai [8]) we published a method for constructing 
good lower and upper bounds on the probability values of type (10.1). This 
method is based on the so called Bonferroni inequalities. First of all one can 
reduce the size of the uncertainty interval by means of these bounds. Let us 
denote 

Prower (4i% 2 Fis ¢=1,...,8) 


and 
Pupper(diz >8;, 1 =1,..+58) 


the lower and upper bounds of the probability value (10.1). Then we can find 
first the values Mower 2nd Aupper for which 


Prower (4) wer 268i, i=1,...,8)=p 


and 
Pupper(4i2ypper 2 8, rg Beery) Tene 


It is clear that we may restrict the search on the interval (Ajower? Aupper) 
instead of the interval (0, 1). 

We calculate the probability values by Monte Carlo simulation. Whereas 
we apply a variance reduction technique (see Szantai [3]) the calculation of the 
probability value (10.1) involves some errors. So we should take special care 
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to stop outside of the feasible domain rather than inside. For this purpose we 
apply a modified stopping rule in the interval bisection algorithm which is as 
follows: 

1. If P(dier, > Bis i= 1,. oe 8) > pte then let Aupper = Ahalf and 
repeat the bisection. 

2. If P(dita ip > Bi, t= 1,...,8) < p—2e then let \joyer = Ahalf and 
repeat the bisection. 

3. If p—2e < P(4i2d aif > Bi, : =1,...,8) <p—e then stop, 2d) olf is 
a boundary point with the prescribed tolerance. 

4. lfp—-e< P(dity > Bi, a= 1,...,8) <p+e then make a new, 
more accurate evaluation of the probability value (i.e. use more random 
numbers in the Monte Carlo simulation). Now 
(a) If Phew (dizay ait > Bis = 1, eeey 8) >p then let Aupper = Ahalf 

and repeat the bisection. 
(b) If Pnew (4itay ait > Bf, t= 1,...,8) < p then stop, 7 alt is a 
boundary point with the prescribed tolerance. 
Here ¢ is the prescribed tolerance, #) the point of the actual search interval 
and Phew the more accurate probability value. The four cases are illustrated 
on Figure 10.1. 


Case 1. 

pte 
p Case 4. 

p-—e 
Case 3. 

p — 2€ 
Case 2. 


Figure 10.1 The stopping rule illustrated. 


For constructing the supporting hyperplane it is necessary to calculate the 
gradient vector of the probability (10.1) as a function of the variables z at the 
actual boundary point. The partial derivatives of the probability (10.1) can be 
expressed by means of the conditional probabilities. As in the case of normal 
distribution the conditional distributions are normal too, and we can apply the 
same Monte Carlo simulation for the gradient vector calculation as before. 

The supporting hyperplane algorithm stops when for the actual point out- 
side the feasible domain 


P(d;2oyt 2 Bis t=1,...,8) >p—e. 


Computer Code for Solution 233 


In this case we accept the last boundary point as the optimal solution of sto- 
chastic programming problem (10.2). 


10.8 A Test Problem 


Let us consider a coffee company marketing three different blends of coffee No. 


1, 2 and 3. The coffee company has developed a rigid set of requirements for 
No. 1 No. 2 No. 3 


acidity <3.5 <4.0 <5.0 
caffeine <2.8 <2.2 <2.4 
liquoring value >7.0 >6.0 >5.0 
hardness <2.5 <3.0 <7.8 
each of its 3 blends: aroma >7.0 >5.0 >4.0 


Forecasts indicate that the demands for the company’s three blends during 
the coming month will be as follows: 


blend No. 1. 3,000 pounds 
blend No. 2. 40,000 pounds 
blend No. 3. 20,000 pounds 


On the first day of a particular month the company found that its available 
supply of green coffees was limited to eight different types as indicated in the 
following table. According to this table, these coffees vary according to (1) 
price, (2) quantity available, and (3) taste characteristics. 





percent 
caffeine | liquoring | hardness | aroma 
content jvalue 


ow rk WH HW ND 


6 
5 
8 
6 
6 
6 
6 
5 








The company is confronted with the problem of determining an optimum 
combination of available green coffees for next month’s roasting operation. We 
may regard the demands for the company’s 3 blends during the coming month 
as normally distributed random variables with expected values equal to the 
forecasts listed above. Then the company should determine an optimum com- 
bination of available green coffees so that the random demands will be met with 
a prescribed probability. Let 2;; be the amount of 7-th type green coffee in the 
blend 7. Then after some scaling we get the stochastic programming problem: 
minimize 
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(3502,, + 20029; + 44023; + 410241 + 36025, + 34026) + 36027; + 190281 
350219 + 200299 + 440232 + 410242 + 360259 + 340262 + 360279 + 190299 
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350213 + 200293 + 440233 + 410243 + 360253 + 340263 + 360273 + 190253) 


0.524) 
711 
211 

0.52); 
711 


—0.4219 


—212 
3219 
213 
—0.6213 
713 
—5.8213 
4213 


244 4+212 +2713 S 
£91 + 22 + 293 
231 + 239 + 233 
£41 + 249 +243 S 20 


+29, —0.5291 
—1.8291 +0.223) 
—2291 +231 
+4.579) —0.523, 
—3291 
0.5299 —232 
—1.22%9. +0.8239 
—293 +2233 
+4299 239 
—293 +2232 
—0.5293 —2233 
—1.4%93 +0.6233 
+3233 
—0.829 —5.8233 
+3233 
Z11 +291 +221 
212 +292 +239 
Zig +293 +232 


+0.524) 
—0.8241 


41 
—0.524, 


—0.2249 


—%42 
+2243 
—243 
—0.4243 
+243 
—5.8243 
+3243 


+241 
+242 
+243 


+251 
+252 
+253 


251 + 259 + 253 
261 + 262 +263 S 
71 +272 +273 S 
Zg1 + tego +293 S 


+0.12z61 

—1.525) —1.7z6) 

251 —2%61 

+0.525, +1.526 
+2251 

—0.5259 —0.42¢9 

—0.7259 —1.lzeq 

+263 

+4252 = +2262 

—1.5253 —1.4z63 

—0.9z53 —1.3z¢3 

+253 +263 

—4.825 —3.8263 

+5253 +3263 

+2761 +271 

+2e. +272 

+263 +273 


< 5 
4 
5 
100 
—0.3277; +1.629; 
—1.427) —1.1lzg, 
—27; — 2281 
40.527, +6.52g) 
271 —621 
—0.8279 +1.1zgq 
—0.8279 —0.52¢ 
—732 
+6254 
+3272 4293 
—1.8273 +0.1l2zg3 
—273 —0.7283 
+273 
—4.8273 +1.2283 
+4273 —3x8 
+21 2 fy 
+292 2 A 
+233 > fs 


> Dy 


<0 
<0 


>0 
<0 
20 
<0 
<0 
>0 
<0 
>0 
<0 
<0 
20 


<0 
>0 


where the random variables £1, 82,83 are normally distributed with expected 
values 
E(#1) = 3, 


E(f2) = 40, E(#3) = 20, 


with variances 


D?(@,) =0.25, D*(f2)=25,  D?(Ps)=9 
and with three different correlation matrices (in three different groups of the 


test problems): 


1 01 O01 1 0 0 1 0.1 0.1 
R=]01 1 #09], R=}0 1 0], R=| 01 1 -089 
0.1 09 1 001 01-09 1 
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Some results concerning the test problems are: 
probability optimal 


level value 
1. Posttive correlations (R,) 
deterministic problem 0.228 18500.0 
stochastic problem No. 1 0.9 22564.0 
stochastic problem No. 2 0.95 23603.6 
stochastic problem No. 3 0.99 25500.6 
2. Independent case (Ra) 
deterministic problem 0.125 18500.0 
stochastic problem No. 1 0.9 22949.4 
stochastic problem No. 2 0.95 23866.6 
stochastic problem No. 3 0.99 25639.8 
3. Negative correlations (R3) 
deterministic problem 0.051 18500.0 
stochastic problem No. 1 0.9 22961.6 
stochastic problem No. 2 0.95 23885.2 
stochastic problem No. 3 0.99 25680.6 


In the above list the deterministic problem always means the linear program- 
ming problem with the forecasted demands. Its optimal solution has different 
probability levels according to the correlation matrices. 
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CHAPTER 11 


CONDITIONAL PROBABILITY AND CONDITIONAL 
EXPECTATION OF A RANDOM VECTOR 


H. Gassmann 


Abstract 


Some problems in stochastic programming require the computation of condi- 
tional information on a multivariate random variable over an n-dimensional 
rectangle. For continuous distributions this involves a multidimensional inte- 
gration and is thus a very hard problem. This paper describes various approxi- 
mation methods in the case of the multivariate normal distribution along with 
numerical evidence of their performance. The extension to more general sets 
and other distributions such as the multi-gamma are discussed as well. 


11.1 Introduction 
Stochastic programming problems of the form 


min Eg p(z, é), 


where € is a random vector on some probability space (0,S5,P) have been 
used extensively in the literature (see e.g. [2],[23],[24] and the references cited 
therein). 

In principle, the above is a nonlinear programming problem and could be 
solved by ordinary NLP techniques. The reason why this is not done is that 
the evaluation of the objective function—and a forttori of derivatives if they 
exist—is often extremely costly, since taking the expectation on € amounts to 
a multidimensional integration or sometimes a finite sum with a large number 
of terms. 

Frequently y(z,€) is convex such that error bounds based on Jensen’s 
inequality and on the Edmundson-Madansky inequality [10],[15] are available, 
and it is these bounds one works with rather than the function itself. Estimates 
are usually of the form 


I ‘ I 
>" pip(z,&) < E;ip(z, €) < So piu’, (11.1) 
f=1 1 


where p; and é are the conditional probability and conditional mean of € given 
that € € A,;. The set {Aj :7 = 1,...,J} forms a partition of the sample space 
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into polyhedral sets (bounded or unbounded), while u' is some upper bound on 
the conditional expectation of y(z, €) given € € A;. 
If € is a continuous random vector with distribution F, then p; = f 4, F, 
t 


é = (€,. ‘ +y€n) = 7 (Ms +59), Where gj = Sa, 6,dF(e). Evaluating these 
multidimensional integrals is a nontrivial problem in its own right, and it is 
this integration problem that the paper is concerned with. The main emphasis 
is on the multivariate normal distribution, and several probabilistic methods 
are developed in Sections 11.2-11.5 to compute p; and qh under the assumption 
that the partition {A;} consists of n-dimensional rectangles of the form 


A; - Ile bi]. 


Section 11.2 describes some trivial cases and estimation by a simple Monte- 
Carlo method. Deak’s decomposition is presented in Section 11.3, Szantai’s 
Bonferroni-type approach appears in Section 11.4. The two techniques are 
combined in Section 11.5 into a hybrid method which attempts to exploit the 
advantages of both. Numerical results are given in Section 11.6 to contrast the 
performance of these methods on a small number of sample problems. Sec- 
tion 11.7 discusses briefly some of the problems encountered when forming the 
quotient & = gi/p;. In Section 11.8 we describe extensions of the various 
techniques to general polyhedral sets and a modification of SzAntai’s method 
to treat other multivariate distributions. 


11.2 Multivariate Normal Distribution; Simple Monte-Carlo 


Arguably the most commonly used continuous multivariate distribution is the 
multivariate normal distribution [13] whose density f is given by 


= 1 —(z-p)'a7'e- 

fla= Ppa ei en (11.2) 
where 7 is the dimension of the random vector z, % its mean and » its co- 
variance matrix, assumed symmetric and positive definite. The multivariate 
normal distribution possesses some attractive properties which will be used in 
the description of some of the methods in this paper. It is well known, for 
instance, that if ¢ ~ N(u,d), then y = Cz ~ N(Cu,CXC’) for an arbitrary 
matrix C, the only proviso being that the product Cz be well defined. To sim- 
plify some of the presentation we shall assume given an n-dimensional rectangle 
A= T]j=1li,6]; and a random vector z ~ N(6,2) where © is a correlation 
matrix, i.e. diag & = (1,1,...,1). This does not constitute a loss of generality 
since z can always be standardized by the linear transformation z — y defined 
by yi = (21 — mi) / fou 


We shall denote 


= | se)devoe= ff ses (ede, (11.3) 
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where / is the normal density as in (11.2). Before discussing the problem in 
full generality, we shall give some special cases for which the solution is easy. 


1. If nm = 1, there are no problems. The conditional probability p = Tt 
2 ; : : 

i e-* /dz can be found for example by expanding the integrand into 

a power series. There are efficient and reliable routines in almost every 

mathematical software package which will perform the computation. The 


numerator qg; is even easier to obtain since the corresponding integral can 
be solved analytically. Thus 


1 2 1 2 2 
-_ 1 -«y/2 = —aj/2 _ ,—67/2 
—e 1 dz = —le 1 e 1 ‘s 
a [ 27 : Tin | } 


2. For n = 2, the answer is obtained almost as easily. The integral for p 
can be developed into a power series [6], similar to the one-dimensional 
case, and commercial software exists which performs the evaluation. In 
order to calculate g, and similarly gy it is possible to exploit the fact that 


2 , santa : 
A(t) =tet /2 permits analytical integration. Thus one may complete the 
square in the exponent, exchange the order of integration, and simplify. 
This gives (details are in [9]) 


ane Gua (=(%acuse -+( Saaz) 
Jar V1 079 Vi- of, 
-(o(seeae) -«(2e22)) 
sso) se) 
01,07 82/? (2 Gaza -#(jcun))) , 


where $(é) is the standard normal distribution function in one dimension. 


3. The jast trivial case occurs when z has independent components. In this 
situation, © = J, and the problem at hand can be reduced to separate 
applications of the one-dimensional computation as follows: 


1 ->>, 2/2 “ f 1 ~22/2 
= —__— 171/"de= i/"dz;, 
p low r=] f foe te 
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Difficulties arise for n > 3 if the components of z are correlated. While 
series expansions do exist [7,8], they converge slowly, and it is usually best 
to estimate the integrals by resorting to some sampling technique. The sim- 
ple Monte-Carlo method [12] consists in generating an independent sample 
{z!,. yee} of size N from the distribution of z, counting all instances for 
which the sample point lies in the rectangle A, ignoring the others, and forming 
the estimator 


~ 1 , 
P= Ll) 
I= 


where 1,4 denotes the indicator function of the rectangle A. It can be shown 
that p is unbiased, that is the expected value of (the random variable) p is equal 
to p. : 

Similarly one has the estimators @ = ei zi1a(z), which can be 
shown to be unbiased for g,. Unfortunately, since the estimators are random 
variables, any performance guarantee can only be formulated in probability, 
and an individual estimator may be far from the true value, even if its variance 
is small, Moreover, the variance of 7, and similarly of g, is proportional to 
the inverse of the sarnple size, which necessitates a rather large sample if any 
meaningful accuracy requirement has to be satisfied. 

For this reason much effort has been invested in finding variance reduction 
schemes. The most popular device is based on “antithetic variables”, that i is, 
whenever z/ is a point generated in the sample, one also includes the point —2/. 
Other, more powerful methods will be described in the following sections. 

A comment should be made here on how to construct the sample points. 
Since y = Cx ~ N(Cpu,CXC’) whenever z ~ N(,%), it suffices to generate 
n independent univariate standard normal deviates wi, t= 1,...,” and to 
calculate z? = Lw? for some matrix [ such that LL’ = . An attractive choice 
for L in this setting is the Choleski decomposition [20] of L, because it can 
be computed efficiently and because its triangular structure may reduce the 
computational effort necessary in calculating z/. 
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11.8 DeAk’s Method 


Deak [8,4] describes a more efficient method for finding the conditional proba- 
bility p, based on the decomposition 


z= AL, (11.4) 


where A is chi-distributed with n degrees of freedom, and wv is uniformly dis- 
tributed on S", the unit sphere in R". Here \ can be interpreted as the length 
of the vector z, v as its direction. L is taken as the Choleski decomposition, as 
in the simple Monte-Carlo method. Then 


p=fsee= ff i dxn(X)dU (0), 


where r;(v) = min{r : 7 > 0,a < rLv < 5}, 
rq(v) = max{r:7r >0,a < rLv < 5}. 


For an illustration of this idea when n = 2, see Figure 11.1. 






ie) (v) Lv 





ry (v) Lv 





Figure 11.1 Deak’s Method in Two Dimensions 


Sampling is then performed on wv only while the one-dimensional integral 


f a dx ,,(A) is calculated explicitly. 
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This gives p = see pw, where pw = lee dxn(A). Again one could 
use simple Monte-Carlo sampling from the uniform distribution on S$", or the 
method of antithetic variables for some reduction in the variance. Deak presents 
a different idea which seams to work quite well in practice. 

Instead of an independent sample of v3, Dedk advocates the use of “or- 
thonormal variates”, that is, starting from a random system {vj,...,v3} of or- 
thonormal directions one forms all linear combinations d? = 73 ey (—1)%v}, 
where ng = 0 or 1 for all £ and ze, # i, if £1) # fo. 

This method has the advantage of creating a large number of random 
points, namely 2* (R), with comparably little effort. The larger sample size has 
a dramatic effect on the variance of the estimator p, and Deak reports a “coeffi- 
cient of efficiency” of up to 1000 [4]. Efficiency is measured as ofty/o{t,, where 
o8 is the approximate variance of the simple Monte-Carlo estimator for a fixed 
N, o} is the approximate variance of Deak’s estimator for the same N, forming 
linear combinations of & directions, t) and ¢, are the respective computation 
times. This particular measure is used because for both methods the variance 
is roughly proportional to 1/N, that is o?t,; is approximately constant. The pa- 
rameter & can in principle be chosen arbitrarily from the set {1,2,...,n}, but 
the maximum value of 2* (r) occurs when k = | 2at1], and the computational 
complexity increases very fast. Deak reports best results for k = 2,3, 4. 

To adapt the method for computing g, one observes that 


I xf (2)dz = if : i AL; vdxn(A)dU (v) 


r3(v) 
= / ie / Adyn(A) dU (0) (11.5) 
gn ry (v) 


rq(v) 
= sf tio f dxn41(A)dU(v), 
gn 1(v) 


r 


r( n La 
where 8 = vee 
This results in the estimator g; = £ ys Lv j , with p computed in analogy 
with py? on the previous page. It should be clear that g; and p can all be 
computed simultaneously from the sample, and the r¢(v) need to be determined 
just once. 
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11.4 Szdntai’s Procedure 
Szantai [22] uses a completely different approach, based on a Bonferroni-type 
decomposition of the sample space R”. It is not hard to see that 


[Oat ase 


1<} 


= De Lage terrfs 6 


t<ypck 


(11.6) 


where A; = {2 : a; > 2; or 6; < 2;} and ¢ is an arbitrary integrand. Using 


the fact that f> 7; dF and fz Tore dF can be calculated easily (we are in the case 


n=lorn= 2, and using simple Monte-Carlo simulation, one has available 
three different unbiased estimates for p, namely 


(i) p() - sampling directly from f Aa, 
(ii) #) - calculating explicitly fon dF — 0, fg. dF =; fs dF; +1—7 and 
sampling from the rest, 
iii) 6) - calculating San dF -D; 5, dF + eer Fac; dF 


-rf [. dF; + (2—- vo iy CEDE= 2) =e - 2) 


i<j 


and sampling from the rest. 
Szantai describes a way to condense the tail of the expansion in (11.6) into a 


single expression which involves only (7), the number of constraints a; < zl < 
b; which are violated by a given sample point z’. In fact, one obtains 


nf dF — Sf yf 


fey YAINA; 


_ Us max{0,7(7) — 1}(z(7) - 2) 


2 


(11.7) 


gel 


Finally, the covariance structure of the estimators is determined to form 
yet another unbiased estimator p(*) = \,p{) + Aap?) +(1-Ar- Ag) Bp), where 
the weights 4; and Ag are chosen so as to minimize the variance of pt), (This 
minimization can be carried out analytically.) Szantai reports improvements in 
efficiency which are of the same order of magnitude as those in [4]. 

In order to adapt Szantai’s method for calculation of the gq, it is necessary 
to evaluate expressions of the form Sa, r,dF, Sajna; z,dF. This can be done 
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by a procedure quite similar to the method presented earlier for the case n = 2, 
namely by completing the square in the exponent of the integrand. This yields 
the formula 


i a,dF = == ope til? © bj — oi;a" a e422 ~ Fiz Gi 
A,;NA; Jin t= o}, 1- 0}. 
= aye? ® b; z= O56: _o aj; — 015; 
1- o}, 1- 0}; 
genet? (2 (Sasa) _ 9 (xmas 
1- oF. 1- o?. 


ay pe? b bj — 04565 _o{% — 045); 
Vl - 98 1-3, 
(11.8) 


In the same manner as before one arrives at three unbiased estimators for 
Gk, namely 


(i) Sampling directly from f, 2,4F, 
(ii) calculating explicitly Sinn zpaF — >>, i, zpdF = Do, ff ve zpaF and sam- 
pling from the rest, 
(iii) calculating explicitly fan zedF — Do; Si, mdF + Voc; Sa,on, a,adF = 
ies Sajna; zeaF +(2-—n)+>); Sa, z,d4F and sampling from the rest. 


Once more the affine combination of least variance is formed. Numerical results 
appear in Section 11.6. 
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11.5. A Hybrid Method 


Deak’s method uses a decomposition of the random variable, while Szantai’s 
procedure can be thought of as a decomposition of the sample space R". Both 
authors report impressive acceleration on the computing time, and it seems nat- 
ural to try to combine the two approaches, exploiting all the desirable features 
simultaneously. The derivation proceeds exactly as in the previous section, but 
now Deadk’s decomposition is to be used in sampling pt), p{?), 3) instead of 
simple Monte-Carlo. Unfortunately this means that Formula (11.7) is no longer 
available. Instead one writes 


he nA; = > 3 nA nA, da) tt Ay" fy 4; et) 


=ye-nf dF (2) = Dee vf, Loco 4 A)dU (v) 
=|. yey i= dyn(A)aU (0), 


where Be = {z : a; < 2 < 6; is violated for exactly ¢ indices +}, 
Ce(v) = {A : a; < AL;v < 6; is violated for exactly £ indices ¢}. 


(11.9) 


In a completely analogous fashion one obtains 


ek a a ~---+(-""' | dF 
AjNAjNA, n,A- 


i<j<k ij 
“L2G fis dyn (X)dU(v). 


In order to estimate the quantities in (11.9) and (11.10), a sample is generated 
from the uniform distribution on S". Since the event £;v = 0 has probability 
zero, its occurrence can be ignored, which makes it possible to determine Ce(v) 
in the following way. 

Let critical values £;,u,; be defined by 4; = min{a,;/L,;v,6;/Liv}, uy = 
max{a,;/L;v,6;/L;v} and assume without loss of generality that the vectors 
é,u are arranged in ascending order. Farther set 29 = up = —o0, fnii = 
Un+i = +00. If 4,2,3 are such that 2;_, < 4 < @ and uj;_; <r < 4j, 
then it is not hard to see that exactly n + 7 — 7 inequalities are violated, i.e. 
 € Cn+;-i(v). Moreover, as \ decreases, the number of inequalities violated 
increases or decreases by 1, whenever the next critical value encountered is from 
the vector @ or u, respectively; none of the inequalities hold if \ > uy. Finally 
Lv € A if and only if all inequalities hold, ie. if 2, < A < uy. 


(11.10) 


This suggests the following algorithmic procedure: 
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Step 0: (Initialize) 
Determine @,u. 
Set nutol < —n, nhigh < —n, nlow < —n. 


Set upper + +00,v, — 0,49 — 0,3 — 0. 


Step 1: (Process the next. interval) 
Set lower — max{tnhighs Eniow })@ — je dxn(). 
Set 2g — nviol — 1,73 — (eer) 
If nviol = 0, set vy — 14 +a. 
If nviol > 2, set vy — vg + tga, v3 — 3 — 230. 

Step 2: (Update) 
If unaigh > Lniow, set nviol — nvtol —1,nhigh — nhigh — 1. 
If tnaigh = Eniow, set nhigh — nhigh —1,nlow — nlow —1. 
If Unhigh < Lniow, set nutol — nviol +1,nlow — nlow —1. 
Set upper — lower. 


If upper > 0, go to 1. 
Step 8: (Generate another sample point and repeat.) 


For given sample size N this defines three estimators 


where 7 is obtained by the same algorithm using a = baad dXn+1(A) in place 


of a. The constant f is as in (11.5). We explicitly remark that the critical values 
£,u have to be calculated just once for each sample point. 

From these triples of estimators one can then form Szantai’s estimator 
qs” = vig? + vig? + Mb i which is the affine combination of least variance. 
(The weights \}, may differ from component to component.) 
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11.6 Numerical Results 


The following page gives an overview of the performance of the various methods 
of sections 2-5 when applied to a small number of low-dimensional problems. 
The abbreviations used in the table are 


SMC - Simple Monte Carlo 
AV - Antithetic Variables 
DMC - Simple Monte Carlo applied to decomposition (11.4) 
DAV - Antithetic Variables applied to decomposition (11.4) 


Dek - Dedk’s method, using & vectors at a time 
Sz - Szdntai’s method 
Hyk - Hybrid method, using & vectors at. a time. 


For each method we report a weighted efficiency rating for the estimator p, 
namely the quantity (0? 9¢100 + O300¢200 + O2oots00 + 7 fo00t1000)"10°, where 
o2 is the variance and i, is the CPU time for sample size 2. Results for the 
G were quite similar and had to be omitted to save space. All computations 
were performed on the Amdahl 460 V/8 computer at the University of British 
Columbia Computing Centre. 


Simple Monte-Carlo is expectedly worst for all four problems, while the 
hybrid method is characterized by extremely slow computing times. This may 
be due in part to the sorting algorithm which was used in determining the vec- 
tors € and u of critical values. Up to 15% of total CPU time was spent in the 
sorting routine VSRTA of the IMSL library which uses a quicksort algorithm. A 
self-contained merge-insertion algorithm [14] might do better in some instances. 
Moreover, the one-dimensional integration could be accelerated greatly by com- 
puting only a table of reference values and interpolating whenever a new value 
is needed. 


Szantai’s procedure is comparable in performance to Deak’s method, except 
on problem 4 where it is markedly inferior. It is interesting to note that problem 
4 is also the problem with the smallest conditional probability. On problem 5, 
all 1000 sample points generated for SzAntai’s method lie outside the region of 
interest. Thus the sample variance is zero, although the estimate is inaccurate. 
The hybrid method clearly outperforms all competitors on problems 3 and 5 
which have the highest conditional probability. 


Further testing is clearly indicated, but at this stage it seems best to use 
Deak’s method whenever the conditional probability is expected to be small 
and the hybrid method if the conditional probability is expected to be large. 
The best value for the parameter & in each case seems to be close to n/2. 
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Table 11.1. Comparison of Efficiency 


Nof [1 | 2 [| 3 [| 4 [ 5 T 6 | 
Dimension 3 3 4 5 10 10 
Probability | 0.028 0.20 0.91 | 0.0064 0.97 0.037 


SMC 



















*No random point was generated within the region of interest. 


11.7 Conditional Expectation 


Up to now the emphasis has been on determining “good” estimators p and Gy for 
the numerator and denominator of Formula (11.3). The real interest, however, 
lies in the ratio q,/p, so it is natural to ask about statistical properties of the 
quotient 9, /P for the various estimators, in particular about its mean and its 
variance. 

The first unpleasant surprise is the fact that the quotients are not unbiased 
[16]. Using a Taylor expansion about the point q,/p, it is not hard to verify 
that ™ 

£(#) = ae (coo ~ oro) +0(N-?), (11.11) 
p p p 
where 000 = var(f), 0h0 = cov(G,,p). The bias can be reduced (but not elimi- 
nated) by forming the estimators 


~ _%& 1 ~ Ie Pa 
Cae zz LVarlé) — Cov(%,#)] (11.12) 


Expanding this expression into a Taylor series shows that E(?;) = qe/p + 
0(N-?). 

It should be noted that in formula (11.11) the true variance and covariance 
are used, while the quantities appearing in (11.12) are their estimates obtained 
from the sample. Further improvements may be effected by retaining higher 
order terms in the Taylor expansion as well as higher order sample moments. 
The possible gain is not easily assessed, however, and storage and computation 
requirements would be increased considerably. 
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Approximate formulas for the variance of the different estimators can also 
be obtained by Taylor expansion, namely 


Var(?,) = * [es — 900 (*)' +0(N~’). 


This variance inflation poses a serious problem, in particular if p is small, and 
cannot be eliminated even if p is known with certainty. On the other hand, 
rectangles of little mass contribute little to the overall bounds in formula (11.1), 
and perhaps the accuracy requirements may be relaxed a little in this case. 


11.8 Extensions 


It may sometimes be desirable io find conditional information on polyhedral 
sets other than n-dimensional rectangles. For instance, a decomposition of the 
sample space into (possibly unbounded) simplices improves the computation of 
the bounds in (11.1) and may be preferred in some situations. 

Let therefore a (non-empty) polyhedral set A := {2:4 < Ta < b} be given, 
where some or all components of the vector a may be set to —oo, similarly 6 
may contain certain components equal to +00. We will assume that the rows of 
T are pairwise linearly independent and again seek to determine the quantities 
p= Sa aF, 4, = Sa z,dF, where z ~ N(0,Z). This problem is very similar to 
the previous problem of n-dimensional rectangles, since the quantity y = Tz 
is normally distributed with mean 0 and covariance matrix > = TXT", which 
may be singular if the number of constraints exceeds n. Thus © may not 
have a Choleski decomp osition; instead the Choleski decomp osition of © should 
be used to form the matrix JZ. From then on it is smooth sailing, all the 
techniques and methods of sections 2-5 will go through provided £ is replaced 
by 7 in all formulas. Sampling should still be done from the n-dimensional 
normal distribution and the uniform distribution on S" where appropriate. 

Other multivariate distributions for which conditional information may be 
of interest include the multilognormal [13] and multigamma [18] distributions. 

A random vector z = (21,..., zn) has a multilognormal distribution if and 
only if the random vector In z = (Inz),...,Inz,) has a multinormal distribu- 
tion. Hence it is possible to work with the vector Inz instead, and all the 
previous results can be used. 

Another interesting distribution is the multigamma distribution which has 
seen some application in chance-constrained stochastic programming problems 
[17]. By a suitable scaling transformation an arbitrary univariable gamma 
distribution can be reduced to standard form which is defined by the density 


1 6-lj-z : 
= Fay * e ifz>0 
falz) { i z<0 
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Hence we can assume without loss of generality that the multivariate gamma 
distribution is such that all the one-dimensional marginals have standard form. 
We shall write z ~I'(8) if z has a standard gamma distribution with parameter 
6. A well-known fact about the standard gamma distribution is its additivity, 
more precisely, if z; ~ I'(@1), 23 ~ I'(@2), and z is independent of zg then 
Rl + za~ T (6, + 62). 

Deak’s method no longer applies to the computation of the conditional 
probabilities, since the decomposition of formula (11.4) is not available any 
more, but Szantai’s method can still be used, and it is easily adapted for com- 
putation of the g, as well. To that end one needs to develop algorithms for the 
evaluation of univariate and bivariate gamma distributions in formula (11.6). 
Once this has been done, the quantities p,q, can be estimated by a simple 
Monte-Carlo method in three different. ways and the combination of least vari- 
ance can be found. 

The univariate case is easy: there are efficient library routines available to 
compute 


and for the conditional expectation one notes that 


I 7 yb-le-2dz a ede td, =o f" ttt) e-edy, 
o (8) o Ors) o V(8+1) 


For the bivariate distribution F'(a1, a2) = f;' iy f (21, 22)dzgdz1, Szantai 
uses the decomposition of z,, z2 into three independent standard gamma distri- 
butions 21,29,23 with suitable parameters 6, , 02,63, respectively, in the form 








4 = 2 + & 
29 = @ + 23 


and obtains the series expansion 


S (21,22) = fo, +6 (21) fo, +03 (22) {1+ 


= pE(Pr +r) (Gs + 52)E (Or +8) por tey—1¢,y pr t8s~1 (yy, 
‘T(A,)T (81 + 62 +r) (61 + 63 + 7) 


F(a1,a2) ra fa +69 (a1) Fo, +83 (a2)+ 
8 
ol 91,89, 83, 1) f0,-+6941(41) fo, +0941 (42) L524 9 (a,)Z714 8 (a9), 
r=1 
where Fg is the standard gamma distribution with parameter 0, 
C(01, 82, 93, r) = 


(r—1)! (0, +r—1)(0, +r—2)...0, 
? (01 +02 +r —1)... (61 +62 +1) (61 +43 +7 —1)...(6: +8341) 
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L5(-) are the Laguerre polynomials [19] defined by L$(x) = 1, £9(z) =0+1—-2 
and the recursion 


(r+ 1)L°,,(2) — (2 +641 —a2)L? (x) +(r +0)L°_,(x)=0, +r =1,2,3,... 


A conditioning argument can be used to derive expressions for 


a, fag a, fag 
/ / aflensa)deadn, [ i 29 f (21, 29)dzgdz1, 
0 Jo 0 Jo 


based on the fact that the random variables z; and z9 are conditionally inde- 
pendent given the value of z,. Therefore one obtains 


q| aQ a} a3 fee) 
/ / yf (21,22) deqdz, -| / ad (ens) | So, (21) da, dzq dz, 
0 Jo 0 Jo 0 
oo ay a9 
= i | i 2S (21, 22) fo, (v1 )dzadzy dey 
0 Jo Jo 


a, AaQ aj ay 
=| / / 2, + (21 — 21) f(a. + (21 —21)), 
Fy “Fy 
2 + (22 —21)) fo, (21 )dzgdz, dz, 


a}AaQ aj-Fy a9-2#) 
= PO ex 2a) deg le) Sra) Fe en )deadender 
0 0 0 
8, Aq9 a,—2y aQ-7) 
-| ij / 21 fo, (21) fog (22) fo, (22)dzadzg day 
0 
aya a1,—-2y a9—2, 
$f Po Po eaten) da, (21) ay ee) deadendey 
0 0 0 
ayAaQg a, -2] a9-2] 
= a f / / So, +1 (21) fog (22) S03 (ta )d2g dag da, 
0 0 0 


a; AaQ ay —F) ag-2) 
+0 f / / Sog+1(22) fo, (21) fo, (22)deadrgday 
0 0 0 
= OF" (a1 ’ a3) + OF" (41,42), 
where F’ is the distribution of the random vector 
zy = 2) +29 
z= 2, +23,2, ~T(61 +1). 
F" is the distribution of 


zt" = 2, +23 
un" fd 
zy = 2, + 23,25 ~T (62 +1). 


Similarly one can derive the formula 


a, fag 
/ / 29 f (21, 22)dzedz, = 0,F" (a1, 02) +63 F"" (a, 42), 
0 Jo 


252 Stochastic Optimization Problems 


where F”” is the distribution of 
zi =21 +24 
zy’ = 2, +25',25' ~T (03 +1). 


The same idea can be used to derive values for expressions of the form 


a| a9 foo} 
/ / / 2ef(z1, 22, 23)dzgdzqdz), 
o Jo Jo 


where {(21, 22,23) is the joint density of a trivariate gamma distribution which 
can be decomposed into independent standard gamma distributions z; ~ I'(8;) 
as follows: 


a4 = wt +24 +25 +27 
23° zg +24 +% +27. (11.13) 
24° 23 tz5 +26 +27 


Conditioning on the values of 24,25,26,27 and simplifying, one obtains 


a) a9 fo 0) 
/ / / 23 f(z1,22,%3)dzgdzqdzy 
0 Jo Jo 


= §3F3(a1, 42,00) + 05F5 (a1, 42,00) + O6F6 (a1, 42, 00) + 07 F7(a1, a2, 00), 


where F; stands for the cumulative trivariate gamma distribution having the 
same decomposition structure as (11.13), when 2; is replaced by 2} ~I'(0;+1). 

The integration in the z3-direction is over the whole support of the ran- 
dom variable and can thus be suppressed by working with the two-dimensional 
marginal distributions of z,, 2. After suitably aggregating the components of 
z;, this yields the expressions 


F3 (a1, 42,00) = Fy (a1, 49364 +97,0; + 95,89 +6), 
F5(a1,43,00) = Fi (a1, a2304 + 67,01 +95 +1,02 + 96), 
(a 

( 


1 
Fg (a1, 42,00) = FG (a1, 49304 + 07,0; + 05,09 +0641), 
F7(a1,42,00) = Fy (a1, 42504 +87 +1,01 +65, 02 +96), 


where e.g. F; is the cumulative distribution of 
a=yty 


2h =Vi tygsy) ~T (844 97), 9) ~T (Or +85 +1),43 ~T(02 +86). 
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CHAPTER 12 


AN L-SHAPED METHOD COMPUTER CODE FOR 
MULTI-STAGE STOCHASTIC LINEAR PROGRAMS 


J.R. Birge 


Abstract 


A computer code implementing the L-shaped method of Van Slyke and Wets 
is described. The method is generalized to apply to problems with up to three 
periods and up to three hundred seventy-five different future scenarios. The 
main subroutines are described. 


12.1 Introduction 


The L-shaped method for two-stage stochastic linear programs was given by 
Van Slyke and Wets [20], see Chapter 3. It is an outer linearization procedure 
that approximates the convex objective term in the stochastic program by suc- 
cessively appending supporting hyperplanes. This paper describes a multi-stage 
implementation of this algorithm in which the supports are found by optimiz- 
ing a nested sequence of problems. The mechanics of this algorithm and its 
convergence properties are described in Birge [4]. 

The method is a type of nested decomposition procedure that can be com- 
pared with inner linearization procedures such as those of Glassey [9, 10] and 
Ho and Manne [18]. It is also related to basis factorization approaches (Kall 
[14], Strazicky [19], see also the next chapter) and inner linearization of the 
dual (Dantzig and Madansky [6]). 

The basic steps of the algorithm are described in Section 12.2. The main 
subroutines of the computer code are then given in Section 12.3. Significant 
variables and data structures are also described. Input and output formats are 
detailed in Section 12.4 along with examples of their form. Section 12.5 presents 
some observations and potential extensions. 
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12.2 Algorithm Description 
The multi-stage stochastic linear program considered by the algorithm is 
minimize 2, + E¢,[mincgzy +--+ + E¢,[minCrez] vo] 
subject to A ,2; = by, 
By 2 Agtq = €, 


(12.1) 
Br_\tp_-) tArer = Er, 
a 2 0,f =1,...,7, & € Bet =2,...,7, 
where c; is a known vector in R™ for t = 1,...,7, 5; is a known vector in 


IR™, & is a random mg-vector defined on the probability space (&;,¥;, F) 
for = 1,...,7, and A; and Beubt are correspondingly dimensioned known 
real-valued matrices. “Eg,” denotes mathematical expectation with respect to 
bt. 

The L-shaped method of Van Slyke and Wets [20] applies to (1) when 
T = 2. Methods for the multi-stage problem have generally assumed a spe- 
cific structure for the problem. Beale, et al. [2] and Ashford [1] for example, 
consider a multi-stage production problem and implement an appropriate ap- 
proximation. The generalization of the L-shaped method implemented in the 
computer code described here and introduced in Birge [3],[4] does not, however, 
require any special structure except that the random vectors €; are discretely 
distributed. 

The algorithm is called the Nested Decomposition for Stochastic Program- 
ming Algorithm (NDSPA). It is based on the observation that given a realization 
€] of the random vector in period ¢ and given a solution of) from period t —1, 
the decision problem at period ¢ can be written (see Wets [21]) 


minimize ¢,2/ + Qt (2?) (12.2) 

subject to Aeal = él + By 122) (12.22) 
Dpig}] > db) ,£=1,...,9, (12.26) 
x 20, 


where g¢+1(a) is a convex function, De e€ IR” for all J, r < me4i, and 
(12.2b) is a feasibility cut, see Chapter 3. 
Program (12.2) can then be solved using a relazed master problem: 


minimize cent +63 (12.3) 
subject. to Ae} i é} + By, 224) (12.3a) 
Df a} > de) ,0=1,...,73 (12.36) 
Bese) +6) > e&,£=1,...,6 (12.3c) 


a! >0. (12.34) 
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Program (12.3) is solved to obtain (#,0)). If % < Q,4:(#) then another 
optimality cut (12.3c) is added to (12.3) and (12.3) is resolved. If zbar} forces 
infeasibility in any future period then a feastbility cut (12.3b) is added to (12.3). 
This process is repeated until? > Qt+1 (zi). For the construction of feasibility 
and optimality cuts, see Chapter 3. 

For implementation in multi-stage problems, it is assumed that there are 
a finite number Ky of scenarios in each period t. The scenarios consist of all 
possible realizations of the random vectors from periods 2 through t. For every 
period ¢ scenario j, there corresponds a unique ancestor scenario a(j) in period 
t —1 and, perhaps, several descendant scenarios d(j) in period t +1. NDSPA 
solves (12.1) by first obtaining a feasible solution to (12.3) for all t and 7 and 
by then sequentially solving (12.2) using the relaxation in (12.3) from periods 
T to one. 


NDSPA 


Step 0. 
Solve (12.3) for ¢ = 1 (dropping the scenario index j) where 6, = 0, 
r1 = 6, = 0 and (12.3a) is replaced by Aiz, = 6). Set e = 0 and 
r} =e; =0 in (12.3) for all t and scenarios j at t. (The indices 7? and 6 
are updated whenever a constraint (12.3b) or (12.3c) is added to (12.3)). 


Step 1. 

If (12.3) is infeasible for £ = 1, STOP. The problem (12.1) is infeasible. 

Otherwise, let Z, be the current optimal solution of (12.3) for ¢ = 1. Use 

%, as in input in (12.3a) for f = 2. Solve (12.3) for ¢ = 2 and all &, 

j =1,...,Kg. If any period two problem (12.3) is infeasible, then add 

a feasibility cut (12.3b) to (12.3) for f = 1, resolve (12.3) for £ = 1, and 

return to 1. Otherwise let t = 2 and go to 2. 

Step 2. 

(a) Let the current period ¢ optimal solutions be H, 7 = 1,..., Ky. Solve 
(12.3) for ¢ +1 and all j = 1,...,A+41 using the ancestor solution 
abar} in (12.3a). 

(b) If any period t +1 problem is infeasible, add a feasibility cut (12.3b) to 
the corresponding ancestor period £ problem and resolve that problem. 

If the period ¢ problem is infeasible, let = ¢ —1. 

If f = 1, go to 1. 

Otherwise, return to 2.a. 

Otherwise, all period ¢ + 1 problems (12.3) are feasible. 

Ift <7 —2, let t =f +1 and return to 2.a. 

Otherwise (t = 7’ — 1), remove the 63 = 0 restriction for all periods 
7 and scenarios 7 at 7. Let the current value of each 0) be 03 = —oo 
if no constraints (12.3c) are present. Go to 3. 
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Step 8. 
a) Find B®? and ef? for a new constraint (12.3c) at each scenario ¢ problem 
t { 
(12.3) using the current period ¢ + 1 solutions. 


(b) If there exists 7 such that 
B < ef) — Ebigi, (12.4) 


then add the new constraint (12.3c) to each period ¢ problem (12.3) for 
which (12.4) holds. Solve each period ¢ problem (12.3). Use the resulting 
solutions (= , 67) to form (12.3a) for the corresponding descendant period 
t +1 problems (12.3) and resolve each period ¢ +1 problem (12.3). 


Ift <T —1, lett =t+1 and go to 2.a. 
Otherwise, return to 3.a. 


Otherwise, cA = er? -Ep a for all scenarios 7 at ¢. 


Ifé > 1, let # =¢—1 and return to 3.a. Otherwise, STOP. The current 
solutions ?, r = 1,...,7 form an optimal solution of (12.1). 


Steps 1 and 2 of NDSPA represent a forward pass to obtain feasibility in each 
scenario subproblem. Step 3 is a backward pass that solves (2) beginning with 
period 7 and passing backward to period 1. Unboundedness may be handled 
explicitly in the program following the procedure in Van Slyke and Wets [20] 
but in the computer code of NDSPA all variables are upper bounded and hence 
unboundedness is avoided. For period T, the computer code also has a special 
procedure for solving (3). It uses the bunching (see Wets [22] and Chapter 3, 
Section 3.4) method to look through all realizations of €r and find those for 
which a given basis is optimal. This procedure is described in the next section 
and represents an alternative to the siffing procedure of Garstka and Rutenberg 
[8}. 

Experimental results using NDSPA have been encouraging. In Birge [8, 
4], NDSPA is compared with a piecewise linear partitioning algorithm, a basis 
factorization procedure and the code MINOS (Murtagh and Saunders [17]) on 
a set of staircase test problems from Ho and Loute [12]. NDSPA consistently 
outperformed the other methods except on one problem in which its storage 
limitations were exceeded. In general, the results compared favorably with 
those of Kallberg and Kusy [15] and Kallberg, White, and Ziemba [16] for 
simple recourse problems. Each stochastic problem was solved in less than 
twice the time required to solve the deterministic problem with expectations 
substituted for the random variables. 
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12.8 The NDSTS Computer Code-Primary Subroutines 


NDSPA has been coded in FORTRAN in a current version called NDST3. 
This code allows for three periods including three second period scenarios and 
one hundred twenty-five third period scenarios. Each scenario problem (12.3) 
is limited to three hundred fifty rows and six hundred columns. Within any 
scenario problem (12.3), there can be at most three thousand nonzero elements. 
Tolerances can be set in the BLOCK DATA section and in the example are set 
at 10-7 for zero tolerance, 10~° for pivot tolerance, 10~* for reduced cost 
tolerance and 10~!° for small tolerance. The linear programming sections of 
the code are from 1, PM-1 written by J.A. Tomlin (Pfefferkorn and Tomlin 
[18)). 

Many variables in NDST3 have multiple subscripts. This questionable 
programming technique is used to make the scenario obvious. For example, 
XLB(2, 3, 2, 1) is the lower bound on the second variable in scenario 1 in period 
3 with ancestor scenario 2 in period 2. In general, the last three subscripts of 
all variables with more than two subscripts are (JCUR, JPER(2), JPER(3)) 
where JCUR indicates the period of the scenario, JPER(2) indicates the period 
2 ancestor scenario and JPER(3) denotes the period 3 scenario. This last period 
scenario is not used in the current version of NDST3 but has been used for a four 
period implementation. The current version is limited to three periods to avoid 
excessive storage requirements. The code can process four period problems if 
the period 3 index is incremented in all array definitions and sufficient. memory 
is available. The subroutine SHIFTR, which manipulates data storage, must 
also be updated if the dimensions are changed. 

The main variables in the code are stored in the blank common block. 
These variables and their descriptions follow. 


Variable Definition 
B(ij,k,)) Current right-hand side element 7 in period 7 and 
scenario k, € 
X(ij,k,}) Current value of variable basic in row 7 at period 7 and 


scenario k, 
XLB(i,j,k,l) and Lower and upper bounds of variable ¢ at j,k, @ 


XUB(ij,k,l) 
XKSI(iJ,k,]) Current realization of random vector in row 7 at j,k, £ 
YPI(ij,k,]) Current dual variable value for row ¢ at 7,k,@ 


NROW (j,k,]) Current number of rows at j,, @ 
NCOL(j,k,]) Current number of columns at j,k, @ 


NELM(j,k,]) Current number of nonzero elements at j,k, £ 

JH(i,j,k,]) Variable basic in row 2 at J, k,é 

KINBAS(i,j,k,l) Status (basic, nonbasic) of variable ¢ at j,k, @ 

LA,IA,A Linked lists of A; matrix elements 

LE, IE,E Linked lists of elements in eta vector form of basis 
inverse 


LBN,IBN,ABN Linked lists of elements in B; matrices 
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PROB ({j,k,l) Probability of scenario j,k, 2 
The important variables in BLOCK 3 are 
Variable Definition 
NND(i) Number of scenarios in period 7 
NPASS Number of passes from period t tot —1 ort +1 
JPER(i) Current scenario realization in period z 
JOUR Current period 
JPASS Indicator of forward or backward pass; JPASS = 1 for forward, 


JPASS = 2 for backward 
NPER Number of periods T 


In BLOCK 4, the significant variables are 
Variable Definition 

XTOPT Value of —ea) + Ege, for checking for optimality 

PRBY(ij) | Probability of j-th realization of 7-th random element in 
stochastic vector in period T 

PRST(ij,k) Joint probability of z-th realization of first random element, 
j-th realization of second, and &-th realization of third for 
stochastic vector at ¢ 

CBST(ij) | Value of j-th realization of ¢-th random element in stochastic 


vector at T 
NCUR(i) Current realization of ¢-th random element at T 
IBST (i) Row of 7-th random element at T 
NST Number of random elements at T 


The code NDST3 assumes that specific random vectors (with specific prob- 
abilities) are assigned for periods 2 through T — 1 and that at period T the 
random vector includes NST independent random elements. The bunching ap- 
proach can then be easily applied to these possibilities. 

The main program in NDST3 organizes the algorithm and calls subroutines 
to implement the steps of NDSPA. The main routines called in this segment 
are: 


INPUT accepts all data input; 

INCHK echoes input; 

NORMAL solves the linear program in (12.3); 
STRPRT _ reports on current solution; 
NDCOM directs the algorithm for t < T; 
PARSFT controls the algorithm for t = T; 
WRAPUP writes output. 


The main routine calls NORMAL to solve (12.3) if t < T and then calls 
NDCOM to determine which problem to solve next. If ¢ = T, PARSFT is 
called to solve (12.3) for period T and determine the next step of the algorithm. 
JCUR(t) is set equal to NPER+1(i.e., T+1) whenever a terminating condition 
{infeasible or optimal) is met. 
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The following routines are all used by NORMAL in solving the linear pro- 
gram (12.3): 


RHCHCK checks now residuals; 


BTRAN performs backward transformation; 

FORMC forms objective function vector and checks feasibility; 

PRICE computes reduced costs and picks entering column; 

CHUZR performs minimum ratio test and determines leaving variable; 
WRETA forms new eta-vectors for product form of inverse; 

SHIFTR rearranges data storage; 

INVERT computes basis inverse using LU decomposition; 


UNPACK(i) expands i-th column in A; 

BUNPCK(i) expands 7-th column in B; 

SHFTE shifts eta vectors around; 

UPBETA updates right-hand side and basis indicators. 


NORMAL reinverts the basis every INVFRQ iterations or if the maxi- 
mum row residual is greater than 10 times ZTOLZE. A maximum of ITRFRQ 
iterations is allowed. 

The subroutine NDCOM handles all steps for NDSPA for ¢ < T. The 
variable MSTAT is used to indicate infeasibility (QN) or feasibility (QF). If an 
infeasibility is found, then a feasibility cut is added in the subroutine FEASCT 
and ¢ is set tot—1. If the current problem is feasible, then NDCOM determines 
the next subproblem to solve. If every scenario at period ¢ has been solved, the 
NDCOM sets up problem (12.3) for period £ +1. The subroutine BPRODX is 
called here 10 compute B;2]7 and FRMRHS is called to find ef) + Bz. 

If the algorithm has proceeded to the backward pass, the control shifts 

Again, if an infeasibility is found, then a feasibility cut is added to the 
corresponding ancestor scenario problem. First, any cuts (12.3b) or (12.3c) 
that are slack (satisfied as strict inequalities) are deleted in the subroutine 
DLETCT. This option saves on storage and does not affect convergence. NFLG 
= 1 signifies that the current problem (12.3) solution is optimal. Fort < 7 —1, 
the code follows Step 2 of NDSPA and continues tot +1. If t = 7 —1 and 
condition (12.4) is not met, then NDCOM follows the iterations in Step 3 of 
NDSPA. If condition (4) is met in following this backwards iteration then an 
optimality cut (3.3) is placed on the corresponding ancestor scenario using the 
subroutine LKHDCT(K) where K is the preceding period. Optimality at period 
K is checked in the subroutine OPTCHK(K) which sets NFLG = 1 if (4) is not 
met. 

Subroutine PARSFT performs Step 3 of NDSPA for = JT. It includes 
the variable JSTCH(ij,k) that indicates the number of the basis found optimal 
for the alternative with realizations 7,7,k for random elements 1, 2, and 3 
respectively, in the last period. NCUR(i) is the current realization of the 7-th 
random element and NXNF(i) is the realization of the 2-th random element in 
the first infeasible basis found by the bunching procedure. NETND{i) keeps the 
number of eta vectors in the z-th basis and INFLG = 0 for no infeasibilities and 
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1 for an infeasibility found in passing through all alternative random vectors at 
T. YBX is a vector keeping Brit). 

The bunching procedure begins by calling NORMAL to obtain an optimal 
solution. On subsequent iterations, the procedure begins with the previous basis 
which is dual feasible and calls the subroutine DNORML which implements the 
dual simplex method. In either case, if an infeasibility is caught then a feasibility 
cut is made and control returns to the main program. : 

After having found a new optimal basis, the algorithm updates ER , and 


ef , and then loops through all right-hand side alternatives for which no fea- 
sible basis has yet been found. Since every scenario corresponds to the same 
objective function, an optimal basis for any scenario is dual feasible for all other 
scenarios. The appropriate right-hand side is set up and FTRAN is called to 
find the values of the basic variables. The subroutine DCHUZR is then called 
to determine a leaving (infeasible) variable. It returns IROWP = 0 for a fea- 
sible basis which is then optimal. If a leaving variable is found then DCHUZC 
is called to find an entering variable. If no entering variable is found then the 
current scenario is infeasible and control is returned to the main program. If 
an entering variable is found, then the current scenario is marked as the first 
scenario to check in the next bunching loop (if no scenario has been found 
infeasible for the current basis) and the next scenario is tested. 

Whenever a scenario is found to be feasible for the current right-hand side 
then the values of Eq , and ef”, are updated, and the next scenario is chosen. 
When an optimal basis has been found for all period T scenarios then optimality 
is checked using the subroutine XOPTCK. NFLG = 1 is returned if (12.4) is 
not met and the algorithm proceeds back to period T — 1. If (12.4) is met then 
a new optimality cut is added to the ancestor period T — 1 problem. 

The algorithm proceeds through these subroutines until optimality is found 
in NDCOM (for T > 2) or PARSFT(T = 2) or until infeasibility is found in 
NDCOM. When one of these terminal conditions is reached, WRAPUP is called 
and the output described in the next section is produced. 


12.4 Input and Output Formats 
The input format for NDST3 basically follows the MPS standard for mathemat- 
ical programs except in its splitting the data into periods. As a test problem 
we used, arnong others, SCAGR7.S2 which was adapted from the staircase test 
problems of Ho and Loute [12]. It contains two periods for the stochastic pro- 
gram, and, in the second period, there are three independent random variables 
with two values each. This leads to eight total scenarios. 

The first row of the input contains five values used in program. Each is 
entered in [4 format, they are in order: 
IFPROB number of problem; 
IOBJ row of objective function (usually ‘1’); 
INVFRQ iterations between matrix inversions; 
ITRFRQ total number of iterations allowed; 
NPER number of periods. 
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The next NPER rows contain the number of different right-hand side val- 
ues (I4 format) for each period. The first and last periods have 1’s because 
the first period is deterministic and the last. period right-hand sides are input 
separately at the end of the program. The fourth row contains the probability 
of the first right-hand side value in F5.3 format. The next sections are ROWS, 
COLUMNS, and RHS sections for MPS format for all values in the first period 
set of constraints, A, 2) = 6,. Following as ENDATA, lower bounds on all vari- 
ables (excluding slacks) in 9F8.0 format and upper bounds in the same format 
are input. If an initial basis were entered then a section headed by BASIS and 
including columns and the corresponding row in the basis could be entered after 
the COLUMNS section. This format is discussed below as part of the output. 

The next sections of the code include ROWS and COLUMNS sections 
to describe the matrix B, in (12.1). This is followed by an ENDATA and 
the probability of the next period’s first right-hand side vector. The data for 
Agx2 = € would then be entered for each possible re] and, if more periods 
were present, this would be followed in each case by the data for Ba (possibly 
depending on j). This process of repeating the probability of €/, giving the 
data for Aya; = €7 and of then giving B; repeats until all scenarios indicated 
in the command lines of the code have been input. 

The last period scenario input is followed by a section marker STOCH 
which prompts the program to read in separate values and probabilities for 
random elements in the last period. For each random element, we must give 
the row name in columns 5-12, the value of the element is given in F12.4 format 
in columns 25-36 and the probability of that value is given in F12.4 format in 
columns 50-61. Each independent element is input with at most five values 
total. 

Another version of NDST3, called NDST3.A, has also been developed at 
IIASA, Laxenburg, Austria. In this code, input follows the standard format set 
at IIASA except for the first line of input which contains the control parameters. 

NDST3 writes two output files on devices 6 and 7. The first one contains 
most of the iteration and result information. The second contains only the 
variables that were basic in the optimal solution found by the program. That 
output may be inserted into an input file to provide the program with a starting 
basis. For detailed instructions about input/output format and the structure 
of this code, see NDSP User’s Manual, Edwards [1985]. 
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12.5 Extensions and Observations 


As mentioned above, NDST3 can be easily expanded to handle larger problems 
and more scenarios. Some care, however, must be used in maintaining storage 
requirements within acceptable limits. Future versions of the code are planned 
to eliminate some redundancy and to enable more complex problems to be 
solved. Other planned options are to include the possibilities for some contin- 
uous distributions and to use approximating techniques from Birge and Wets 
[5] in achieving convergence within a predetermined tolerance. This has been 
implemented for a single random variable in a new code NDST4 and further 
refinements are planned. 

The code has performed very well in general and in most situations sig- 
nificantly (by an order of magnitude) outperforms general purpose linear pro- 
gramming codes. The one problem in which it did not perform well, is one that 
required that a large number of feasibility cuts be added to the first period 
problem. These cuts were dense and, without deleting slack cuts, the problem 
required an excessive number of nonzero elements (i.e., more than three thou- 
sand). When slack cuts were deleted, the program obtained an unstable basis 
that did not generate a feasible first. period solution. This may be a problem 
inherent in decomposition algorithms because of numerical error present in gen- 
erating cuts. Two truely identical cuts may be generated that differ only in their 
error coefficients. This is the cutting plane analogy of the slow convergence char- 
acteristics observed in Dantzig- Wolfe decomposition (Ho [11]). When NDST3 
was implemented so that the row residuals were checked on every iteration, this 
problem was solvable despite the instability. It then required 1379 simplex iter- 
ations compared to 1742 iterations for a simplex method implementation on the 
deterministic equivalent problem. No other problems tested required NDST3 
to perform as many as half the number of iterations of the simplex method. 
The instability may therefore have caused slower convergence in this example. 
It appears that stability problems are rare but if further testing results in more 
of these difficulties, some testing of the integrity of cuts may have to be added. 
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CHAPTER 13 


THE RELATIONSHIP BETWEEN THE L-SHAPED METHOD 
AND DUAL BASIS FACTORIZATION FOR STOCHASTIC 
LINEAR PROGRAMMING 


JR. Birge 


Abstract 


The basis factorization method of Strazicky for stochastic linear programs is 
shown to involve the same computational effort per iteration as the L-shaped 
method of Van Slyke and Wets. A variant of the factorization approach can 
then be found which is equivalent to the L-shaped method. The advantages of 
this decomposition approach over a standard factorization are discussed. 


13.1 Introduction 
We consider the problem 


minimize cz +Q(z) 
subject to Az =b (13.1) 
2>0, : 


where Q(z) = Eg[min gy subject to Wy = € —Tz,y > 0] 


and € is arandom n3-vector, where A is an ™, X7, real matrix, W is an m2 X79 
real matrix, 7’ is an m2 X mn, real matrix, and c, g, and 6 are correspondingly 
dimensioned vectors. For € € & = {€!,€?,...,€%}, where P(€ = €') =p’, we 
have (13.1) is equivalent to 


minimize cz + p'qy' + p?qy? +---+p% qy% 
subject to Az=b 


Te+Wy =é! 

2 
T2Wy? = €? (13.2) 
T2+Wy% =€% 


zy',y?,...,y% >0. 
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The dual of (13.2) is 


. T T T 
maximize b’o + p'é) a1 4 p26? a? +...4 pV EN a% 
subject to Alotp Tl a! 4p?TTar? +...+pNTT a <7 

wr} < q: 


wr,? <q? (13.3) 


wt, < qt 


Kall [2] and Strazicky [8] observed that any feasible basis of (13.3) may be 
written as 


BY 
(13.4) 
LZ 


where B is a block diagonal matrix. For (13.3), 


Wi i Th 6g | 
B= Wola 
Wi In 
where Awa is an ng X ng submatrix of |W? J] for all = 1,2,...,N. Kall 
notes that we may reduce the size of [Wri] by taking an m2 x my nonsingular 
submatrix Wr from W;?. We have 


wea] Gel-Ed+[] @ ws 


or (Wi) 9+ WF) A = mi, =O — WI)WF)'G- (WT)(WT) “9. So, 
we can rewrite (13.5) as 


(2) (W2)K] B = (a- (W2). (WP)-14, (13.6) 


(13.6) substantially reduces the number of rows from (13.5) but it has a signif- 
icant drawback in terms of nonzero element storage. The sparse matrix wT 
may be transformed into a very dense matrix (W; 7) (W7) ~1, Kall uses this ma- 
trix in solving (13.3) and, therefore, must update the full (nz — m3) x (n2 —ma) 
basis throughout the algorithm. Wets [5] has observed that m3 x m -matrices 
wr should be used as working bases so that updates only need to be performed 
within these sparse matrices instead of in the larger, dense (m— 12) x (ng—ma) 
matrix in Kall’s approach. 
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Wets also suggests that the algorithm may be made even more efficient by 
taking advantage of the repetition of similar bases among the wr. LU decompo- 
sition and sparse updating may additionally be used to improve this approach. 
Wets then conjectures what is shown below: that this method involves the same 
cemputational effort per iteration as the L-shaped method of Van Slyke and 
Wets [4] and that a modification of the dual basis factorization method will 
follow the same path as the L-shaped method. This modification amounts to 
the dual decomposition procedure proposed by Dantzig and Madansky [1]. 


18.2 Discussion 


We assume we have a feasible solution to (13.3), (0°, 71°°,...,7 


p>), where (X°, p!*°,..., p%»°) are slack variables. We also assume that wi = 
WT for all 7 so ak ‘Gals columns from A? and the identity are basic in the 
first set. of constraints. In the pricing operation, we solve 


N,0 A°,p 1,0. 


A® zp +A zy =, 
Inzy =0, 


where Iy is the set of basic identity columns. We have zp = (A?)~1'b and 
check for zp > 0. If some zp (i) < 0, then that column in A® is replaced and 
the problem is solved again. If we restrict ourselves to only checking for primal 
feasibility in the z variables, then we are solving the dual problem 


maximize 67¢ 
N . 2 
subject to A’o <c? — So pirt at? 
fl 


or the primal problem 


N 
minimize (¢— > pia T)2 
i (13.7) 
subject to Az=b 
2>0. 


This is essentially the first step of the L-shaped method. The dual method 
involves the same steps of computing A? 2p = b, 7A? =cp and p=cy —7AN 
as in the primal method, so the computational effort is the same at each step. 
We note that this does not include pricing for y variables as would occur in the 
general dual method. 

After all zp (i) > 0 have been found, we let 2" be the prices and we proceed 
to solve W; y' = €'-Te! for all y’. For every subproblem 3, if y' (7) <0 then we 
choose a leaving column only from the identity columns in subproblem 7. We 
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relax dual feasibility in the first set of constraints. This process is equivalent 
to solving the subproblems 


minimize gy! 
subject to Wy' = é' —Tz', (13.8) 
y 20, 


for all ¢ as in the L-shaped method. We note again that the computations 
involved in a single iteration are the same in both methods except that we do 
not update the prices in the first set of constraints for the factorization method. 
We note also that solutions of (13.8) may be found quickly by finding all €' for 
which a given basis is optimal. 

After solving these problems, we obtain either an unbounded condition or 
all y' > 0. In the former case, some subproblem (13.8) is infeasible. We then 
look at the column in (13.3) which gave the unboundedness condition. For 
yi <0, y= (W;)-1(9,-) -(€! — Tz") and the column —[(W;)(j,-)- Wi]? < 0. 
We let 7 = ~(W)(G, -) and obtain 


n(é! —Tz') >0, (13.9) 


and 
mW, <0. (13.10) 


(13.9) and (13.10) are the infeasibility conditions for (13.8) that we would 
find in the Z-shaped method. In the dual method, we would choose a pivot 
from the first set of constraints so that we would force 7(€' —- Tz) < 0. We 
introduce a new column in the main problem, 


[gre 


(p'TT a) p 
where p > 0. The main problem is then 


maximize 7 +p! (é'a')7 p 
subject to Ato + (p'T7x')p <7, 
p20, 


where é? includes c? and other fixed columns of 7. This is equivalent to adding 
a constraint 
(1'T)z <7'é, 


as in the L-shaped algorithm. We next solve the main problem again and 
repeat. 

If after solving the subproblems all y/ > 0, then either the problem is 
optimal or one of the first set of constraints in (13.3) has been violated. In this 
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case, if we let 8 = 0%, p'((€)” — (T'.21)7)a', where x is the initial solution 
of (13.3), then we have 


N 
6 < S~pi((E)? — (Lei) a"? 
t=1 


where z'-? is the optimal subproblem solution. We observe that either 7’! or 


a'>? or linear combinations of these solutions may be used as solutions for the 
subproblems. We use this to obtain a substitute first pericd problem: 


N N 
maximize B70 + A1() pi (El) Tai) + dal pi (E) Pa") 


f=) i=l 
N . : “ N . m . 
subject to A?o + Mo 2 (€) 7 a!) +9 (Sop (€') Pah?) <e™ (18-11) 
i=1 t=1 
Ay + Ag = 1 
AryAg > 0. 


We solve problem (13.11) and repeat by adding a column for feasibility of the 
subproblems or by adding a column for choices of subproblem solutions as in 
(13.11). We note that these are the same steps as in the L-shaped decomp osition 
method where 9 < es p'((é')7 — (F21)7)z! and a constraint on 6, 


N N 
(>: rt) ato>) pie)’, (13.12) 


f1 


is added. The two methods with these specifications follow the same procedures 
for each iteration. We note also that these methods follow the same steps as 
Dantzig-Wolfe decomposition applied to the dual problem (13.3), (Dantzig- 
Mandansky [1], Van Slyke and Wets [4]). 
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18.38 Conclusion 


We have shown that on each iteration of the L-shaped method, the number of 
steps is equivalent to that of the basis factorization method and that the L- 
shaped method may be viewed as a variant of the basis factorization approach. 
In general, however, the two methods will not follow the same path to optimal- 
ity. By maintaining dual feasibility, the basis factorization restricts the path to 
optimality and requires more effort in checking for feasibility within the first 
set of constraints. 

The decomp osition variant of basis factorization also avoids two other prob- 
lems inherent in the full factorization approach. For X = B~1Y in (13.4), the 
factorization approach uses the inverse of (LX — Z) in performing simplex op- 
erations. X is composed of columns of B™! since Y is composed of identity 
columns. The columns of B~! need not be sparse and may be very dense, 
causing (LX — Z) to be dense as well. The storage requirement for the nonzero 
elements of this n; x m1 matrix may be large. 

Another difficulty in applying this factorization without decomposition is 
that, whenever an identity column in J; in (13.5) is replaced, then wr must be 
changed and (LX — Z) changes. This pivot alters the prices (x) ir all other 
blocks 7 # 7. Therefore, a pivot step is required for each new block into which 
this identity column enters. By fixing z in the decomposition, whenever a new 
matrix We i is introduced, all values €? such that y? = (W7)- 1(€7 —-Tz) >0 
can be found without performing separate pivot operations. For very large N, 
the standard factorization scheme may be forced through a long sequence of 
pivots, whereas the decomposition approach may change these bases quickly. 
For problems with large N, then, the decomposition variant above is probably 
the only tractable basis factorization method. 
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CHAPTER 14 


DESIGN AND IMPLEMENTATION OF A STOCHASTIC 
PROGRAMMING OPTIMIZER WITH RECOURSE AND 
TENDERS 


L. Nazareth 


Abstract 


This paper serves two purposes, to which we give equal emphasis. First, it 
describes an optimization system for solving large-scale stochastic linear pro- 
grams with simple (i.e. decision-free in the second stage) recourse and stochastic 
right-hand-side elements. Second, it is a study of the means whereby large-scale 
Mathematical Programming Systems may be readily extended to handle certain 
forms of uncertainty, through post-optimal options akin to sensitivity or para- 
metric analysis, which we term “recourse analysis”. This latter theme (implicit 
throughout the paper) is explored in a proselytizing manner, in the concluding 
section. 


14.1 Introduction 


This paper is a sequel to Nazareth and Wets [21] and serves two purposes, to 
which we give equal emphasis. First, it describes an optimization system for 
solving a restricted but important class of large-scale stochastic linear programs 
with recourse. Second, it is a study and detailed illustration of the means 
whereby any large-scale Mathematical Programming System (MPS) designed 
for solving deterministic linear programs, could be readily extended to handle 
some forms of uncertainty, in particular, via post-optimal analysis options. This 
latter theme (implicit throughout the paper) is explored, in a proselytizing 
manner, in the concluding section. 

The class of practical stochastic linear programs with which we are con- 
cerned (termed C1 problems in [21]) arise as a natural extension of the linear 
programming model as follows: given a linear program with matrix A, it is often 
the case that some of the components of the right-hand-side (exogenous) vector 
of resource availability or resource demand, are known only in probability and 
have been replaced (in the deterministic LP formulation) by some expected 
value. We seek to extend this linear program, using the recourse formulation. 
Rows of A corresponding to the stochastic right-hand-side are used to define 
the technology matrix TJ (we follow the notation and terminology of [21]) and 
the remaining rows of A are used to define the constraint matrix A, both A 
and T being typically large, sparse matrices. The recourse is assumed to be 
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simple (i.e. dectsion-free in the second-stage problem) and specified in terms of 
costs (or penalties) on shortage and surplus. Furthermore, we restrict attention 
to the case where each component of the stochastic right-hand-side has a given 
discrete probability distribution. There are many applications for such a model, 
see Ziemba [27], and more complex stochastic linear programs with recourse 
can sometimes be solved by an iterative discretization or sampling procedure 
involving definition and solution of a sequence of C1 problems. 

The above considerations are very much in the background of our imple- 
mentation design, our choice of algorithms and of the more general issues which 
we wish to discuss regarding the extension of conventional Mathematical Pro- 
gramming Systems, so as to be able to handle at least some forms of uncertainty. 
Our optimization system is based primarily upon a version of Wolfe’s general- 
ized programming algorithm (see Dantzig [4]) given in Nazareth and Wets [21] 
Section 3.2.1 and, in more detail, in Nazareth [18]. It also includes a version 
of an algorithm based upon bounded variables (see Wets [25]) given in [21] 
Section 2.1 and, again in more detail, in [20]. Two simpler options, namely 
to solve an initial linear program and to permit some of its constraints to be 
“elastic” are also included to help get a recourse problem “off the ground”. In 
our implementation (see Nazareth [19] for an overview of our overall approach) 
we have utilized current mathematical programming technology for specifying 
the problem (using standard MPS conventions [14] for the LP portion and a 
suitable extension to provide the added stochastic information), to represent 
the data internally (in packed data structures, space for which is dynamically 
allocated within a work storage array) and to implement our solution strate- 
gies (using an efficient and numerically stable implementation of the simplex 
method, namely the MINOS System of Murtagh and Saunders [15], [16]). 

Finally, we want our design to mesh as naturally as possible with current 
Mathematical Programming Systems. In particular, we argue in the concluding 
section of our paper, that “recourse analysis” (simple recourse to start off with, 
but also more general forms of recourse) could be provided as a post-optimal 
analysis option in any large-scale MPS, to augment the options for parametric 
and sensitivity analysis that are now usually available. 
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14.2 Overview of the SPORT System 


14.3.1 Problem 

SPORT (pronounced SupPORT) is an acronym for Stochastic Programming 
Optimizer with Recourse and Tenders. The current version solves large-scale 
stochastic programs with simple (decision-free in the second stage) recourse and 
discrete distribution of right-hand-side elements (termed C1 problems), The 
formal statement of such problems may be found in [21] (see (1.1) through (1.3) 
where W = [/,—J] and where the right-hand-side h(w) is the only stochastic 
quantity, with a known discrete distribution) and we shall not repeat here. 
Instead, we shall state the problem from the perspective emphasized in this 
paper, namely that of a given linear program in which inherent uncertainty in 
some of the right-hand-side (exogenous) elements is to be more fully taken into 
account. Consider therefore the linear program 


minimize cz 
subject to Az=d (14.1) 
z>0 


where A is an m Xn matrix (which is generally large and sparse), d is a given m- 
vector and ¢ is a given n-vector. Some of the elements of d which correspond to 
demands (or available resources) may be, in reality, only known in probability 
and defined in (14.1) by taking some expected value. For simplicity, let us 
suppose that the corresponding “technology” constraints of (14.1) are the last 
mg constraints and let us denote them by Tz = h, where T is an mg Xn matrix. 
Let the Temaining m, constraints be Az = 6 where A is an m, X nm matrix and 
d=). 

A useful extension to the LP model (14. 1) i is to permit the constraints 
Tz =h to be “elastic” (Tomlin [24]) by imposing a penalty g;* on shortage 
in the ¢-th technology constraint when demand (corresponding to the right- 
hand-side element h;) exceeds the supply (T'z);, so that y* =h, — (Tx); > 0. 
Similarly let g; be the penalty imposed on surplus (when the reverse of the 
ope conditions holds) so that y; = (Tz); —h; > 0. (The choice of notation 

+ for shortage and q; for surplus is a little unfortunate, but is now standard.) 
Thus associated with the decision 2 for the “first-stage” or decision variables, 
we have a penalty of 

+ h; —(T2);) when (h; — (Tz);) > 
Qi (2,4) ie ((Tz); —h;) when (h; — (T2),) < ‘ 
To minimize first stage costs and all penalty costs we can formulate the exten- 
sion of (14.1) as a problem with “elastic” constraints as follows: 
minimize cz+qtyt+q y~ 
subject to Az =6 
Tet+ty'-y =h 
2>0,y' 2>0,y° 20 


(14.2) 
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where gt and g~ are m-vectors with components g,* and gq; respectively. 

Unfortunately (14.2) does not address the uncertainty in the right-hand 
side vector, which so far has been replaced by h. One way to address uncertainty 
is to compute the penalty cost associated with each realization of the random 
vector h(w). Let us also define the “tender” or “bill of goods” associated with 
a decision z by x = Tz. Thus we have 


gq; (Ai(w) — xi) when (h;(w) — x;) = 0 
q (xi — hi (w)) when (h;(w) — x1) < 0 
Let ¥ (x) 4 Ey (oa bi (ai hi (w)) = Lie Ew (vi (xis hi (w)) £ vet WY; (xi). 


We seek to minimize the cost of the decision cz and the expected value of the 
penalty costs. Thus we can formulate this extension of (14.1) as 


QilesFi) 2 dilxivhi(u)) = | 


ma 
minimize cz + ‘Ss ¥; (xi) 


i=l 


subject to Az=b (14.3) 
Tz-—x=0 
z>0 


For C1 problems it can be readily demonstrated (see, for example [25], [20]) 
that 

¥, (xi) = eapax, (siexi + ee) 
where sg and e;g are defined from the probability distribution of h,(-). Let 


this be given by values hits hinge +s hi, with hie < hi,t41, with associated 
probabilities pj1,pi2,---,pix;- Then, for £=0,..., & 


é 
82 (» he) a — qi 
i=1 
a é 
ee = G hi — Gi (>: heme 


t=1 


(14.4) 


where, by convention, yy = 0, g = (gq +9) > 0 and h; is the expected 
value of h;(w). Finally, using a theorem in [18], it is possible to state (14.3) in 
an egutvalent form and in so doing also unify with (14.2) as follows: 


m2 
minimize ce+qtyt'+qy t+ > ¥; (x) 
m1 


subject to Ax =b (14.5) 
ety -y¥ —x=0 
#20,y" 20y 20 
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For x © h we obtain (14.2) since (A) is then a constant term. (14.5) is a 
piecewise-linear separable convex programming problem with which we shall 
be concerned henceforth. It makes possible both convenient implementation of 
the algorithms which we employ and the various options that we provide, as 
discussed in the next section. 


14.2.2 Algorithms 


The system is based primarily upon the Wolfe generalized programming ap- 
proach as discussed in [21], Section 3.2.1. The particular algorithm imple- 
mented here termed ILSRDD (Inner Linearization—Simple Recourse— Dis- 
crete Distribution) is described, in detail, in [18]. The generalized program- 
ming approach was chosen because it proved effective in earlier experimental 
versions (see [18]} and because of its potential applicability to a wide class 
of stochastic programs (including problems with complete recourse and prob- 
lems with probabilistic constraints, see [18]). We also include an alternative to 
ILSRDD. This is algorithm based upon problem redefinition and the introduc- 
tion of bounded variables given by Wets [25] and implemented in the simpler 
form given in Nazareth and Wets [20]. The algorithm is termed BVSRDD 
(Bounded Variables-Simple Recourse-Discrete Distribution). This approach is 
much more limited in its range of possible application as we have discussed in 
[21], but we include it for the following reasons: (a) it is very convenient to 
have a second algorithm that works on basically the same input as ILSRDD, 
for purposes of comparisons of answers and validation of implementation. Two 
identical answers on a particular problem from two different algorithms are 
rather comforting in this world of uncertainty and although this is no guaran- 
tee of correctness, it provides some indication that an error (if any) is in the 
input data or its conversion into internal representations. (b) A fair amount of 
experience has been accumulated with an early implementation of this method 
for dense problems (see Kallberg & Kusy [11]) and a more advanced implemen- 
tation (which handles sparsity) should be available. (When there are relatively 
few points in each distribution of h;(-) then this may even be a quite efficient 
way to solve Cl problems. (c) The algorithm BVSRDD makes possible a sim- 
pler and more direct extension of a deterministic MPS when the aim is only to 
handle simple recourse. 

Two further options are provided in order to be able to solve (14.5) with 
x = A (ELASTIC option) and in order to solve an initial linear program, 
equivalent to (14.5) with y = h, g* = q; = oo (MINOS option). Here, h 
denotes an arbitrary right-hand-side vector. Both of these options are useful as 
preliminaries to the recourse formulation. 
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14.2.3 Implementation 


From a practical standpoint, the linear programs which we want to solve and 
minimize 


extend are of the more general form: 
subject to -( b (14.6) 
K<a2d<u 


< 
where | 5 } indicates that constraints take one of three possible forms and u 


IA IV ILIA 


and @ are vectors of upper and lower bounds. Furthermore, we cannot usually 
expect the partition A = (4) with technology rows T coming last in the matrix 
A. In general, rows of A and rows of T will be interleaved in A. In addition, 
it is worthwhile to explicitly include a scale factor » to permit a weighting of 
the second-stage objective relative to the first (see [18]). Thus the practical 
problems which we seek to solve, are derived from (14.5) and (14.6) and take 
the form 
m2 
minimize cztqtyt+qy t+ o>. ¥; (xs) 
t=1 
subject to Ale(<=>)b,, aed (14.7) 


Mizty? —y -xi=0, ET 
l<a2dsuyty 20. 


where A%', a; € A defines the rows of A, A", 7, ET defines the rows of 7, and 
A and T are index sets with |A| = m,, || = mo (|A{ denotes the number of 
indices in A). 

Our system for solving recourse problems of the form (14.7) has three main 
phases: 

Phase 1: Problem Setup and Generation 

Phase 2: Specialized Setup and Solution 

Phase 3: Output 


This is summarized in Figure 14.1. A design goal was that all algorithms 
work on essentially the same input and each ignore input data that is only 
required by the others, e.g. the limit on the number of cycles, which is only 
required by ILSRDD. The input is specified in the form of three files of infor- 
mation which are described in more detail in the next section. All that is often 
necessary to switch options is to change the algorithm card in the “control” 
file and check that enough work space has been provided for various items. 
The Problem Setup and Generation Phase results in the creation of two files 
required by MINOS—the SPECS file and the MPS file. The next main phase 
consists of reading in these files by MINOS, inserting additional columns into 
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its packed data structures and finding the solution of the problem. Finally the 
Output Phase augments the solution output by MINOS with some additional 
information about the solution of the stochastic program with recourse. 

The next three sections go into this in more detail. 


14.8 Problem Setup and Generation 


To be specific, we discuss this within the context of a very simple example. 
Consider the following product-mix example (due to J. Ho [10]). The problem 
has two products and three ingredients. We seek to minimize cost of produc- 
tion while maintaining the levels of fat and protein at acceptable levels, and 
not exceeding availability of ingredients. The demand for each product is a 
random variable with discrete distribution but in an LP formulation this must 
be replaced by some expected value. The problem is summarized as follows, 
where 2;, yj, z; denote the amount of each ingredient in product ¢ (i = 1,2). 


minimize zy +2y, +432; +29 +2y, + 329 (OBJ) 
subject to 

Fat/ Protein 

Content. of 

Product 1: 0.32; +0.4y; +0.22; > 3.3 (A3) 
Fat/Protein 

Content of 

Product 2: 0.5y2 +0.622 <4.0 (A4) 
Amount of 
Ingredient 1: 2, +2 <15.0 (Al) 
Amount of 
Ingredient. 2: Wy +y2 <12.0 (A2) 
Amount of 7 

Product 1: zy + yy + 2 =hy (T1) 
Amount of 2 

Product 2: fg +42 +22 =hg (T2) 


LisYiy ti 2 0,2=1,2 


The penalties for under and over production are 2.0 and 1.0 units, respec- 
tively, for each product, and the probability distribution on demand h(-) is as 
follows: 


Product 1 Product 2 
height2pt 
height2pt 
Level 8.0 10.0 12.0 15.0 18.0 20.0 


Probability 0.25 0.5 0.25 0.2 0.4 0.4 
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Figure 14.1 Overview 
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hy = 10.0 and hy = 18.2. The recourse function (x) is defined in the usual 
way with g+ = (2.0,2.0) and g” = (1.0,1.0). 

This simplified example will be quite adequate for purposes of illustration, 
and it can obviously be scaled up to a more realistic problem involving several 
products and ingredients. 


14.8.1 Corefile 


The input data corresponding to the decision variables z of the problem forms 
the “corefile”. This specifies 


- the names and types of each row of the problem 
- the objective c 

- the coefficients of A and T 

- the deterministic right-hand-side elements 


- the bounds on variables and ranges on rows 


The “corefile” is specified in standard MPS format, see [14] and will often 
originate in a prior LP formulation. A and T can have interleaved rows and 
rows corresponding to 7’ should normally be equality rows. However if these 
correspond to > or < rows ie if there is no penalty on surplus or shortage, 
respectively, then provision is made in the system to change these to equality 
rows and a warning message is printed to that effect. This means that Ge or 
gq; must be chosen appropriately at value zero. Note also that if there were 
non-zero elements in the right-hand-side vector corresponding to rows in the 
technology matrix they will be ignored by ILSRDD or BVSRDD and a message 
printed to this effect. 

For our example, the corefile is given in Figure 14.2. (Slack variables were 
introduced explicitly in this case, but this is not necessary and could have been 
avoided by appropriate definition of row types.) 


14.3.2 Stochastics File 
The “stochastics” file specifies the information pertaining to the recourse prob- 
lem. It gives: 

- the row names identifying the technology matrix 

- the probability distribution for each stochastic right-hand side 

- the penalties gt and g” on shortage and surplus 

- the set of initial tenders for ILSRDD or the base tender for BVSRDD 

An MP%S-like format was designed for each of these items of information 


and is explained in the rest of this subsection. (An extension of this format is 

given in Edwards et al. [7].) 

NAME This is a header card. The user may enter any characters in columns 
15 to 72. 

TECHNOLOGY The data consists of a list of names, one for each row in 

the technology matrix. These must be a subset of the list of rownames in the 
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NAME ip 
ROWS 
N OBJ 
E AL 
E 42 
E A3 
E AG 
— T1 
E12 
COLLINS 
CLM1 OBJ 1.0 Al 1.0 
CLM1 A3 0.3 Ti 1.0 
CLM2 OBJ 2.0 AZ 1.0 
CLM2 A3 0.4 T1 1.0 
CLM3 OBJ 3.0 A3 0.2 
CLM3 T1 1.0 
CLM4 43 -1.0 
CLMS BJ 1.0 Al 1.0 
CLMS T2 1.0 
CLM& OBJ 2.0 A2 1.0 
CLM& AG 0.5 T2 1.0 
CLM? OBJ 3.0 AS 0.4 
CLM? T2 1.0 
CLMS A4 -1.0 
CLM? Al 1.0 
CLM10 AZ 1.0 
RHS 
RTH Al 15.0 AZ 12.0 
RTH 43 3.3 AS 4.0 
RTH Tt 10.0 T2 18.2 
BOUNDS 
ENDATA 


Figure 14.2 The corefile 


“corefile”. The submatrix corresponding to this set of rows in the COLUMNS 
section of the “corefile” defines the technology matrix. One name appears per 
line in columns 5 through 12. 

DISTRIBUTION The data consists of sets of entries of the form “rowname 
value probability”. There is one such set for each of the rows named in the 
TECHNOLOGY section. “rowname” specifies the row associated with the en- 
try (columns 5 through 12). “value” and “probability” specify the point and 
its associated probability. They occupy the first and second numeric fields 
(columns 25 through 36 and 50 through 61) respectively and must be spec- 
ified as real numbers. The “rowname” repeats itself for each possible value 
associated with the row and the probabilities for this “rowname” must sum to 
unity. 

OBJECTIVE The data consists of entries of the form “name value value” 
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where name is a rowname of T and the first value gives the value of g,* 
and the second the value of g; i.e the penalties on shortage and surplus 
respectively. The name occupies the first field (columns 5 through 12) 
and the values the first and second numeric fields (columns 25 through 36 
and 50 through 61) respectively. They must be specified as real numbers. 


TENDERS The data consists of entries of the form “name rowname value” 


where name is the name associated with tender, “rowname” specifies the 
row associated with the entry and “value” is the level of the tender for 
this row. “name” repeats itself over all entries associated with the tender 
and there is one such “name” for each tender specified. “name” and 
“rowname” occupy the first two name fields (columns 5 through 12) and 
(15 through 22) respectively and “value” the first numeric field (columns 
25 through 36). (If a set of these are provided for ILSRDD then the first 
one is used by BVRDD as its base tender, see Sec. 2.1 of [21].) 


ENDATA This card must be specified and flags the end of the “stochastics” 


file. 


For our example the “stochastics” file is given in Figure 14.3. 


NAME TEST 
TECHNOLOGY 
T1 
TZ 
DISTRIBUTION 
TL 8.0 0.25 
T1 10.90 0.5 
T1 12.0 0:29 
T2 15.0 0.2 
T2 18.0 0.4 
T2 20.0 0.4 
OBJECTIVE 
T1 2.0 1.0 
TZ 2.0 1.0 
TENDERS 
TEND1 T1 8.0 
TEND1 T2 15.0 
ENDATA 


Figure 14.8 The stochastics file 
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14.8.8 Control File 


The “control” file provides the information needed to guide the solution process. 
It gives: 
- algorithm selected (generalized linear programming, bounded variable al- 
gorithm, elastic constraints or linear programming) 


- input/output units for the files used by the system 

- dimensioning information for various arrays within the system 
- names of objective and right-hand-side vectors 

- additional control parameters e.g. output level, cycle limit, etc. 
- specification cards for MINOS 


Our design here is similar to the MINOS SPECS file, but our format spec- 
ification is more rigid and is based upon fields of four characters. Each main 
section is identified by a principal keyword which begins in column 1. Within 
each of these further options are identified by a second keyword which begins 
in column 5. Each of these options may have further suboptions and these are 
in turn identified by keywords beginning in column 9. The numerical strings or 
integers which provide the information that goes with a keyword are specified 
in a data field given by columns 23 through 30. Integers must of course be right 
justified. Only the first four characters (including blanks) of any keyword are 
significant. 

The principal keywords, i.e. the keywords beginning in column 1, must be 
specified even when all defaults are selected. 

The keywords are as follows: 


BEGIN This is a delimiter identifying the beginning of the control file 

ALGORITHM This identifies the selected algorithm. Options are ILSRDD, 
BVSRDD, ELASTIC or MINOS. 

UNIT NUMBERS The unit numbers are specified as follows: 


CORE unit number of “corefile”. Default = 5 

STOCHASTICS unit number of “stochastics” file. Default = 7 

SPECS unit number of the MINOS SPECS file. Default = 8 

MPS _ unit numbers of the MINOS file specifying the matrix. Default = 9 
DEBUG unit number for debugging information. Default = 0 (no output) 
LOG unit number of the log file. Default = 0 (no output) 


DIMENSIONS This specifies information for setting up the work array 

ELEMENTS an upper bound on the number of elements in the matrix 
(including space for input and generated tenders). Default = 1500 

ROWS an upper bound on the number of rows (including technology). 
Default = 100 

TECHNOLOGY an upper bound on the number of technology rows. De- 
fault = 20 

COLUMNS an upper bound on the number of columns in the matrix 
(including tenders). Default = 300 
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PROBABILITIES an upper bound on the number of discrete levels as- 
sociated with each stochastic right-hand side. Default = 30 


TENDERS This provides information on tenders as follows: 


INPUT an upper bound on the number specified in the “stochastics” 
file. Default = 1 

GENERATED an upper bound on the number of tenders saved. Used 
in the round robin strategy. Default = 20 

ELEMENTS an upper bound on the total number of tender elements. 
Default = 2000 


Note: One must be careful about specifying these quantities. 
SELECTORS 
OBJECTIVE name of the objective row—up to 8 characters (must be 
provided) 
RHS name of the right-hand-side vector—up to 8 characters (must be 
provided) 
BOUNDS name of the bounds vector—up to 8 characters 
RANGES name of the ranges vector—up to 8 characters 
CONTROL OPTIONS 
OUTPUT output level. Options are 1, 2 or 3, which provide increasingly 
verbose output. Default = 2 
CYCLE limit on number of tenders generated. Default = 1 
SCALE scale factor (see (14.1)), expressed as a percentage 
(p = SCALE/100). Default = 100. 


MINOS SPECIFICATIONS Here one specifies any MINOS options which are 
then echoed into the MINOS SPECS file. 
END Delimiter indicating the end of the control section 


In our example the “control” file is given in Figure 14.4. 


14.8.4 Implementation of Problem Setup 

This is done using some modules from LPKIT (see Nazareth [17]) suitably 
modified to suit our purposes. Additional routines have been written to set up 
information specified in the “stochastics” file into packed data structures and 
to generate the MINOS SPECS and MPS files. 


14.4. Specialized Setup and Solution 


This part of the implementation is built around MINOS Version 5.0 whose 
outermost routines MINOS1 and MINOS2 were modified for our purposes. In 
particular, the PHANTOM COLUMNS option of MINOS (simply a device to 
provide some “elbow-room” in the data structures holding the problem) is ex- 
tensively used in order to complete the setup of the recourse problem in the 
packed data structures used by the MINOS system. 
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BEGIN 
ALGOR I THM ILSRDD 
UNIT NUMBERS 
CORE FILE 10 
STOCHASTICS FILE 11 
SPECS FILE 12 
MPS FILE 13 
DEBUG FILE 14 
LOG FILE 14 
DIMENSIONS 
ELEMENTS 700 
ROWS 10 
COLLMNS 40 
PROBABILITIES 20 
INPUT 1 
GENERATED 1a 
SLEMENTS 9 
SELECTORS 
OBJECT! OBJ 
RHS RTH 
CONTROL OPTIONS 
OUTPUT Z 
CYCLE LIMIT 8 
SCALE FACTOR 10C 
£ND 


Figure 14.4. The control file 


14.4.1 ILSRDD 


The master program is defined by expression (3.7) in [21] with W © |J,—J] and 
the obvious extension to match expression (14.7) in this paper. MINOS 5.0 sets 
up the A and T matrices in packed data structures from the MPS file which 
was generated in the previous phase. Then our modifications to subroutine MI- 
NOS2 pack in the additional columns corresponding to tenders. Other routines 
developed by us, which are called within the subroutine MINOS2, implement 
the generalized linear programming algorithm in coordination with the solution 
of each master program by MINOS 5.0. The detailed algorithm is given in [18]. 
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14.4.2 BVSRDD 


This is an implementation of the bounded variable method of Wets [25] in the 
form given in [21], Section 2.1. Further details of the algorithm may be found 
in [30]. There is a danger of performing a large number of pivot operations 
when the probability distribution of each right-hand-side element has many 
points (the so-called epsilon-to-death problem) but the associated computa- 
tional effort is alleviated by the way in which MINOS updates its basis matrix 
representation. It is possible to improve the implementation (a) by using some 
of the acceleration techniques discussed in Wets [25] which, in effect, carry out 
several basis changes at the same time, (b) by specifying a good starting basis 
from the special structure in (14.7). 

In contrast to ILSRDD, implementation is much more straightforward be- 
cause only an initial linear program must be set up. 


14.4.8 ELASTIC 


This option implements the linear program (14.2) (see Section 14.2.1 of this 
paper), thereby permitting the “technology rows” to be elastic. The row names 
defining the technology rows and the penalties g* and g™ are defined by the 
stochastics file. Other data in this file is ignored. 


14.4.4 MINOS 


This simply provides the preliminary option of solving an initial linear program. 
The data in the stochastics file is not required here. 


14.5 Output Phase 
The output consists of two parts: 


(a) MINOS output in standard MPS format. For a description of this see 
Murtagh & Saunders [16]. 

(b) SPORT output. This gives the first-stage and second-stage costs the op- 
timal tender, the dual multipliers (prices) associated with the technology 
tows in the optimal solution and the probability levels of the equivalent 
chance-constrained program. 


For the earlier example the output is given in Figure 14.5. 
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14.6 Testing 

The program has been exercised on several test problems as follows: 

(a) The product-mix example of Section 14.3 due to J. Ho. This is a “toy” 
problem with 5 rows of which 2 are technology rows and 6 first-stage deci- 
sion variables. 

(b) The test problem given by Kallberg & Kusy [11]. This too is a “toy” prob- 

lem with 3 rows of which 2 are technology rows and 6 first-stage decision 

variables. (Documented in King [13].) 

The test problem given by Cleef [8]. This has 9 rows of which 6 are 

technology rows and 16 first-stage decision variables. (Documented in King 

[13], 

The problem of allocating aircraft to routes given in Dantzig [4]. This has 

9 rows of which 5 are technology rows and 29 first-stage decision variables. 

(Documented in King [13].) 

A discretized version of the stochastic transportation problem given by 

Qi [23] formulated as a standard stochastic linear program with simple 

recourse. This has 78 rows of which 44 are technology rows and 1496 

first-stage decision variables. 

The bank asset and liability model given by Kusy & Ziemba [13] and a 

full-scale version of problem (d) above both provide good illustrations of the 

practical applications for which our program is designed. 


{c 


— 


(d 


— 


(e 


— 


14.7 Sportsmanship 

The current system can be applied to a wider range of problems than would 
appear at first sight. For example when the stochastic linear program has sto- 
chastic technology matrices with a few discrete probability levels (which are 
independent of the right-hand-side distribution) say, T,,...,7; with probabili- 


ties p,...,p{, then we can express this as an equivalent problem 
minimize eatpigtyl trig yy te tnatyg +e ue 
subject. to Az = 6b 
+ 
T2+ [I,—Z("") = hw) 
: (14.8) 
. e uf ° 
Te wk aE) = ale) 
t 
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Let us treat T defined by 
Ti 


T; 
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as a technology matrix in the usual way. Then we can set up the problem 
so that it can be solved by the system, as described earlier, with appropriate 
definition of penalties and distribution determined by (14.8). 

In some situations the underlying probability distribution of h(-) is only 
known implicitly through a simulation model involving the random elements w. 
Nazareth [18] discusses how the system can be extended to this case (see, in 
particular, Section 3.2 of [18] for some numerical experiments). 

When the probability distribution of A(-) is not discrete, SPORT 2.0 can 
be used in conjunction with some iterative discretization procedure and com- 
putation of error bounds (see, for example, [26]). 

When a more complex penalty structure is imposed on the second stage, 
program modifications would be required. This could, in many cases, be done 
fairly easily. 


14.8 Availability 


The Fortran implementation described here, SPORT 2.0 (pronounced Sup- 
PORT Version 2.0) was developed for use at IIASA on the VAX 11/780 (under 
the UNIX operating system). It uses MINOS 5.0 (the latest documented ver- 
sion), which is available in-house. Using the terminology in Nazareth [19], the 
current version of our system is a level-2 implementation, designed for algo- 
rithmic experimentation and for problem solving by an experienced user (one 
expected to be familiar both with his problem and with the implemented algo- 
rithm). 

To use SPORT 2.0 at another site, it would be necessary to obtain MINOS 
5.0 independently from Stanford University and to subsistute our set of Fortran 
routines for the two MINOS 5.0 files MIOOMAIN and MI1LOMACH. (Note that 
SPORT 2.0 will not run with versions of MINOS below 5.0.) 

An earlier version of our system, designed for MINOS 4.9, SPORT 1.1, 
is available on the SDS/ADO tape, which is a collection of a number of rou- 
tines for stochastic programming. This version provides readable Fortran and 
a manual (see Edwards [6]) to document our implementation. Note that it is 
not executable, since MINOS 4.9 is not included with it. 

In order to obtain a copy of SPORT 2.0, please contact the author of 
this article at either of the following addresses: IIASA, System and Decision 
Sciences, A-2361, Laxenburg, Austria or CDSS, P.O. Box 4908, Berkeley, Cali- 
fornia 94704, USA. 
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14.9 Stochastic Programming with Recourse as a Form of Post- 
optimal Analysis in a Mathematical Programming System 


Many large-scale Mathematical Programming Systems (e.g., MPSX/370 [1]) 
provide options for performing parametric and sensitivity analysis in the optimal 
solution of a linear program and for repeated (and efficient) reoptimization 
through a dual simplex procedure, when the right-hand-side is changed. (For 
MINOS, post-optimal analysis routines have been developed by Dobrowski, et 

al [5].) 

A common approach for handling uncertainty in the right-hand-side is to 
use scenario analysis, which is indeed greatly facilitated by the above post- 
optimal options. Ermoliev and Wets [8] characterize this approach to dealing 
with uncertainty as being “seriously flawed“ and explain why as follows: “Al- 
though it (scenario analysis) can identify ‘optimal’ solutions for each scenario 
(that specifies some values for the unknown parameters), it does not provide any 
clue as to how these ‘optimal’ solutions should be combined to produce a merely 
reasonable decision.” Another approach that has been utilized by mathematical 
programmers as discussed in Section 14.2.1 is to introduce elastic constraints 
by defining penalties on shortage and surplus for a given right-hand-side. This, 
as we have noted, is in the spirit of the recourse model, but it does not yet 
address the stochastic aspect of the right-hand-side elements. 

One aim of our paper has been to demonstrate (hopefully convincingly) 
that recourse analysis could be introduced in a very natural way as a post- 
optimal analysis option in an MPS and that its implementation is not substan- 
tially more difficult than that of other post-optimal analysis options currently 
provided within them. It could be argued, of course, since problem (14.7) can 
be directly expressed as a linear program, that it could be left up to the user 
to set up this linear program, create the appropriate MPS file and solve it in 
the conventional way. This is to impose upon him or her a laborious and er- 
ror prone task. To do so would be as unreasonable as requiring that the user 
implement his own post-optimal parametric and sensitivity analysis. Another 
approach is to use an extended LP system based upon piecewise-linear (separa- 
ble) programming (see Fourer [9]) to solve (14.5) or (14.7). Unfortunately such 
systems are not available as general purpose software. Thus it is necessary to 
fall back upon the more conventional mathematical programming systems. 

The particular implementation described in earlier sections of this paper 
was developed for MINOS (specifically Version 5.0) in its linear programming 
mode, but an implementation for another large-scale linear programming sys- 
tem (MPS) could be patterned along rather similar lines (see, in particular, 
Figure 14.1). This would require the following: 

(a) Firstly, augmentation of the standard MPS description of a linear program 
(which may be formulated and solved as a first step) by some standardized 
description of the stochastic information. A format along similar lines to 
Section 14.3.2 would be quite appropriate. Note that this does not conflict 
with the trend toward high-level modeling systems for defining mathe- 
matical programming problems (see, for example, the GAMS System of 
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Brooke, et al. [2]). MPS format (and its extension to stochastic problems) 
primarily serves the purpose of formalizing the interface to optimization 
codes and indeed MPS format continues to play this role in systems like 
GAMS. (With regard to the third “control” file of Figure 14.1, note that 
this is specific to the MINOS implementation and would obviously vary 
with different MPS systems.) 


Secondly, set up of one or more linear programming problems correspond- 
ing to (14.7) by augmenting internal data structures. The more straightfor- 
ward implementation (because it involves only one augmentation) is to use 
some version of the bounded variable method of Wets [25] as in BVSRDD 
(see Section 14.4.2.). Assuming that a deterministic version of the problem 
has already been solved, the additional columns could be inserted directly 
into the packed data representation used by the MPS from the stochastic 
information supplied as described in (a) above, and the problem reopti- 
mized. (It would be wasteful to generate a fresh MPS file for (14.7).) In 
MPSX/370, the augmentation and reoptimization could be done through 
the Extended Control Language (see [1]). The difficulty with the bounded 
variable approach arises when the distribution has many points, for ex- 
ample, when it is obtained by discretizing a continuous distribution. See 
the discussion in Section 14.4.2. Also it does not generalize to nonsimple 
recourse. The alternative is to implement the generalized linear program- 
ming approach, again directly inserting the added columns into internal 
data structures and solving a sequence of linear programs, each starting 
off where the previous one left off (as in ILSRDD, Section 14.4.1). As 
we have seen, implementation required modification only of the outermost 
level of MINOS and we believe this would be true for other MPS systems 
as well. The ILSRDD algorithm is very efficient in this context and as we 
may note, the approach applies to more general forms of recourse. 


Thirdly, the output of the solution in an appropriate way, again done most 
conveniently through access to the internal data structure. 


To summarize, the mathematical programming field is ripe for incorporat- 


ing some forms of stochastic programming with recourse into current large-scale 
MPS systems. We have provided a detailed illustration of how it can be done 
for one currently available MPS and how it could (possibly even should) be 
done for other systems, 
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CHAPTER 15 


AN IMPLEMENTATION OF THE LAGRANGIAN 
FINITE-GENERATION METHOD 


A.J. King 


15.1 Introduction 

An experimental code of the Lagrangian finite generation technique has been 
developed at IIASA for solving stochastic quadratic programs with simple re- 
course [1]: 


find z€ R" to magimize: 


n n @ 
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where @ is a piecewise linear-quadratic function given by: 
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the quantities t;;h; are square summable random variables and the other coef- 
ficients are fixed (nonstochastic) with d; > 0 and e > 0. 

The algorithm generates a sequence of points {7,4 = 1,...} that converge 
at a linear rate to the optimal solution, by solving at each step a modified version 
(SQP,,) of the original problem obtained by adding to the objective of (SQP) 
a proximal term [2]. More precisely we modify (SQP) by changing the linear 
and quadratic coefficients as follows 

of = cy +0, Bt g=l,...,7n 
ds =d;+,' j=l,....n 
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which has the effect of adding the proximal term 3-8, '||z—Z* ||? to the objective. 
To solve (SQP,,) we proceed by way of the dual [1], [8]: 


find 2€17(0,F,P:R*),y €R™ to minimize : 
m é é 
Se wibi + > E{x;h;} + 4S) Efe;n7} 
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The properties of this problem: the appearance of the integrals in the objective 
constraints, and the simple nature of the boundary constraints on 2, permit us 
to solve this dual problem by a finite generation technique whereby we replace 
minimization over Z = II,[—q; ,q;*] with minimization over the convex set 


generated by a certain collection of elements Z” = {€’,... ee, which turns 
out to be ordinary quadratic-programming. We then use the information gained 
by solving DQP,, over co Z” to generate a new collection Z”*), and in this way 
obtain a sequence {¥” = 1,...} (the dual variables to DQP,) which converge 
at a linear rate to the optimal solution of SQP, [1]. 

The Lagrangian finite generation method requires that the random quan- 
tities (h,t) have finite discrete support. Of course it is not a restriction in the 
sense that some sort of discrete approximation scheme is needed to carry out 
the integrations. Discretization of measures for the solution of stochastic op- 
timization problems is currently a very active research area. We will describe 
some of the work in this direction below in Section (15.6.2). 
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Figure 15.1 Graph of e;8(q, 19°36; 0) 


15.2 Discussion of SQP as a Model for Recourse Problems 


The most important feature of SQP is the recourse penalty term for the first 
stage variables which takes the following form: 

This can be viewed as a linear recourse penalty with a quadratic transition 
and is a generalization of the function ¢;9;9(e;'v;) in [8]. The role of the 
piecewise linear quadratic penalty in the problem SQP is identical to that of 
the piecewise linear penalty in the stochastic linear program with recourse. 

The usual statement of the stochastic linear program with simple recourse 
is as follows [4]: 


choose 2€ R" to magimize: 
n € 
Yo eyty — >, Elgt vi tay; } 
t=1 f=1 
subject to =e Se2;<rf g=1,...,7 


n 
(SLP) YS ayey Sh 8 =1),...5m 
j=l 


n 

vi ov, =)> tz; —hy Ets Re 4 
j=l 

vi >0,v; >Oas. ¢=1,...,2 


With this formulation it is easy to see that if we take d; = 0 and write down 
the limiting version of SQP as e; > 0, then we obtain SLP. 

The reader should note that it is possible to solve SLP with the present 
algorithm—but in general the rate of convergence will not be linear. It is also 
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possible to specify any value whatsoever to gq; a providing of course that 
-q < q;'- An important special case is the setting g¢, = 0,g;' > 0 giving a 
linear quadratic penalty of the form: 





0 q: 


Figure 15.2 Graph of ¢6(0,9;'3e; ‘v%) 


This type of penalty finds application in a variety of problems in resource man- 
agement problems where the concern is to achieve E{v;} < 0 and simultane- 
ously to reduce the variance of the v; above 0. The flexibility of the linear- 
quadratic penalty allows the decision maker to find, through the process of ad- 
justing the various parameters, decision vectors Z giving second stage outcomes 
v,; with certain desirable combinations of expectation and variance. An appli- 
cation of this model to the Lake Balaton eutrophication problem is discussed 
elsewhere in this volume, [9]. A second important case is where one or both of 
qe a; are infinite. This would give a purely quadratic recourse penalty in the 
appropriate direction. i.e., a one or two sided least-squares problem. Of course 
the same conments apply to rp atz- We treat the case where g) = qi >0 
separately in the next section as an important potential application of these to 
a numerical optimization problem in statistics. 
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15.8 Application of SQP to Robust Statistical Estimation 


An important problem in applied statistics is to estimate the parameters z of 
a linear model y = Tz, there T is a given (possibly stochastic) matrix, from 
observations {y,,4 = 1,...,N} which may be distorted by a noise term of 
unknown distribution. One formulation of this problem is the least squares 
model as originally proposed by Gauss: 


choose z to minimize: 


(LS) 1x 
N > lye - T2|. 
k=1 


Robust estimation in general is concerned with techniques of assessing and 
reducing the influence of the given set of observations upon the estimation of 
the parameter z (cf. [6]). One such technique is to modify the LS problem, 
reducing the influence of outliers in the sample by the use of the function: 


SiH tf>a 
at) = {bo if |r| <1 


giving the model (robust least squares): 


choose (z,0) to minimize: 


N 
(RLS) Ae —Tz)| 


Except for the appearance of the o as one of the variables involved in the 
minimization, the problem RLS can easily be seen as a particular interpretation 
of the model SQP, where o corresponds to the e;, we take g7 = gi = 1, and 
set c; = d; = 0. (In practice, s—which is called the “nuisance parameter” for 
obvious reasons!— is usually held fixed in the solution of RLS.) 

Thus we can derive in an analogous fashion to section 1 a sequence of 
problems RLS, whose dual: 
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e 


choose z € R® to minimize: 


N N n 

1 1 a 

W ) aye boy y one + y de O(r; st; 3 w;/ de) 
k=1 k=1 j=l 


subject to -l1<ze<1 k=1,...,N 


N 
1 : 
ae zal; g=l,...,n 
=1 


can be solved by the finite generation technique. In this formulation we have 
introduced primal constraints of the form —r; < 2; < ae but this is not a 
serious problem since z; can always be taken to be bounded in practice. As an 
additional feature, of course, we can include linear inequality constraints into 
the model RLS if it is required. 


15.4 The Lagrangian Finite Generation Technique 

In the paper [1], the authors develop a technique for solving a class of stochastic 
quadratic programs of which SQP, is a special example. The key idea is to 
approximate the dual problem DQP, by a sequence of quadratic subproblems 
which correspond to maximizing the dual objective over the convex hull of 
finitely many dual feasible solutions. The technique can be summarized in the 
following way: 


1. find (X”,#”) saddlepoint of L“(x, 2) over X x co ZY 
2. find a” € argmax L“ (X”, 3) 
<€Z 


3. determine Z’*) = {€1,...,€yv+1} D {2”,2”}, return to step | with vy = 
v+l1 
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where ZL” is the augmented Lagrangian associated with the primal-dual pair 


{SQP,,DQP, ie 
The saddlepoint in step 1 is found by solving DQP, over the convex poly- 


tope co Z”: 


choose \€ RY" 2 © R" to mazimize: 


@ nv @ NY NY . 
Dus +> >> Mhexp? +: iS > > AFF geap** 
i=lk=1 f=1k=17/=1 


Se w? tr tw? + bdt( (w})?] 
j=1 
subject to MM >0 ¢=1,...,€ k=1,...,N” 


NY 
SAP St deus! 


(EQ) k=1 
y 20 t=1,...,m 


e nv 


B 
w) +i — wv} = de iais— >_>, Mtexph, j=1,...,7 


= f1k=1 
suet 2>0 jg=1,...,” 
where hezpt = E{hj€*} i=1,...,€ k=1,...,N” 
texpk, = E{€ft;)} t=1,...,€ k=1,...,NY 
gexpt*’ = E{e,e* e} $1. fk Re yl” 
Kio =1,...,N” 


This is ordinary quadratic programming which can be solved by any one of 
a number of reliable codes, for example MINOS [6]. The dual multipliers for 
the n equality constraints give the primal decision vector ¥”, and the element 
a” = ore gives the dual half of the saddlepoint for step 1. 

k 
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The maximization of the second step (over the whole space Z) can be 
obtained in closed form: 


n 
ay =! G39G'3e So tisk —h; 
j-1 


where @ is the derivation of the piecewise linear-quadratic function §, 


-1" iff<-g 
Pid sgtst)=or if-1-<r<qt. 
q’ ifr2qt 


There are a number of ways to generate the set Z”*! in step 3 [1, section 
3]. In the implementation at IIASA we set 


y Amn =7° gear de ees 


where v is a predetermined maximum number of finite elements and Z° is 
some initial fixed set. The finite generation method can be interpreted as a 
cutting plane technique, so although we are only required in theory to set 
Z’*! = {9",2”} one expects and in fact one obtains better results if more 
“cuts” are included. 

Following this intuitive line of reasoning, we note also that it would be 
advantageous to include the element 2 obtained by solving the preceding aug- 
mented Lagrangian L“~1! whose saddlepoint is denoted by (z*,2“). The effect 
of including the element 7 in the initial set Z° for at least the first few iter- 
ations of the finite generation method is quite dramatic as can be seen in the 
following example. This is a product mix problem in the form: 


4 2 
maximize ‘S Cy2j— S> E{e:0(0,995¢; 'vi)} 
j=l 


tl 


subject to O< 2; <1; j=1,...,4! 


4 
V,= So ti;2, —h,, i= 1,2. 
j=l 


The t matrix entries are all independent uniform and the h vector entries are 
independent normal. (For details see the Product Mix Problem [10]. In this 
example the algorithm halts when the relative duality gap, the normalized dif- 
ference between the dual and primal, is less than 107°. 
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Example 1: Product mix problems 


number of inner loop 


outer loop ste iterations {QP’s duality ga 
p=l1 4 0.8 x 107 
p=2 (a) with 2 0.7 x 107? 
(b) without 7 2 0.5 x 10~? 
pas (a) with z 2 0.5 x 10-8 
(b) without 2 0.5 x 10~? 
w=d4 (a) with a =r 
(b) without 7 2 0.5 x 10°? 


15.5 Implementation 
In this section we describe the details which transform the theory of the previous 
section into a practicable numerical method. The computer implementation of 
the finite generation method at IIASA is presently in an experimental and 
developmental stage. We shall keep the discussion focussed on the algorithm 
itself and the important numerical details which must be considered in any 
implementation. 

Here is a rough outline of the complete algorithm. Details are described in 
the discussion which follows. 


LAGRANGIAN FINITE GENERATION ALGORITHM 
0. Initialization 


Read in data A,b,c,d,e,g° 97,4 er tsn,m, & 
Choose a 
Set p=l1 
1. Outer Loop (Augmented Lagrangian cycle) 
Set 8p = Oby-1 
Set dé =d+e,),c# =c+e7' 
Determine Z! = {€,,...,€yi} and # € Z! 
Set v=1 
gi a7 


2. Inner Loop (Finite generation method) 


(a) Calculate a = 6! (oat 5 5 & _ »|] 
i 
a=1,...,2 


Put Be LY 
For each element é ED’, k=1,...,N” calculate: 


heap(i,k) = Ef{bj€} c=1,...,€ &=1,...,N" 
texp(i,j,k) = E{t;;€;} i=1,...,2£ 
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gexp(i, k, kt) = Ee €; ere} gc 1,...,07 =1,...5” 


(b) Solve (QP) using MINOS; obtain solutions: Le 
me A 


Set 2 ME 
= 2% 
8. Stopping Criteria 
(a) test <: 
If inner loop not converged 
Determine Zt! se eget 
Set gti = ev 


Return to 2. with y =v+1 


If inner loop converged 


Set get) — ge 
ee 
gel _ gw 


(b) test z#*+? 
If outer loop not converged, return to step 1 with p = p+1 
If outer loop converged, then stop. 


Step 0. The initial guess Z! serves merely to provide a starting point for 


the augmented Lagrangian procedure. One could just as well set Z' = 0. In 
the current implementation, @° is the solution of the deterministic quadratic 
program obtained when the random variables (h,t) are replaced by their ex- 
pected values in the problem (SP}. The initial “guess” Z' is included only for 
reasons of symmetry; the current implementation ignores it. However Z will 
become important when the algorithm utilizes the full primal-dual augmented 
Lagrangian, as discussed below under “future developments”. 

Step 1. The primary purpose of this step is to update the augmented 
Lagrangian: 


1 
LP (2,2) = TF (2,2) — ae, | — x)? 


The factor o used to update ¢, is usually set between 1 and 2; for theoretical 
reasons we need o > 1. 

Step 2. This is the finite generation method. The method consists of two 
optimization problems 


(1.) find saddlepoint (X”,#”) of L*(z,%) over X x Z” 
(2.) find 2” € argmin L*(X”, 3). 
sEZ 


We have apparently inverted the order of the optimizations. In fact the 
present arrangement just serves to calculate the initial finite element 2! from 
= without unnecessary duplication of codes. The second minimization (2) is 
achieved in closed form, just as it appears in part (a) of this step. 
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The calculation of the integrals herp,tezp, gezp in preparation for solving 
(QP) is dependent on the particular representation of the random variables 
(h,t) to be discussed in Section 15.6.2. 

The saddlepoint problem of the finite generation method, in the form of the 
dual quadratic program (QP) (see above, Section 4), is solved by MINOS. The 
proper files and subroutines for utilizing the MINOS software are automatically 
generated by the computer codes; and the relevant solutions aE and the dual 
multipliers ¥ xX} (for the w; equations) are passed from MINOS directly back to 
the algorithm. The pair (£”,2”) is the saddlepoint for the first optimization 
problem (1) in the finite generation method. 

Step 3. There are two sequences being generated {{”} in the “inner loop” 
and {z"} in the “outer loop”, and we must specify stopping criteria for both. 
Of course in each case we specify a certain number of maximum iterations, 
typically 10 is the maxinmm for both, and once this maximmm is reached we 
stop and make do with what we have. In the case of the inner loop we pass the 
last obtained (suboptimal) ¥” on as the next proximal point g#+! and hope for 
a better result on the next outer loop iteration; in the case of the outer loop we 
stop with a warning that the final Z is not optimal. 

In each case we test the relative norm of the difference of the successive 
iterates ; 

jz" —z*—1\| {|Z || < chieps 
Wx” — XT /|R"l] < chiepe 


The threshold “chieps” is a parameter chosen by the user. We have found that 
chieps = 10-5 gives good results. 

Finally there are two criteria based on the duality gaps for the respective 
problems, In the case of the inner loop it is possible to derive a criterion which 
ensures that the estimates z”t! = 2”, while if not precisely the saddlepoint of 
L#, represents a good step in the sequence {Z"}, i.e. it gives a linear rate of 
convergence. From [2] we know that to obtain the linear rate it is sufficient to 
choose Z#+! so that 

2 


6 
[a — My YIP < ede — IP 


where M, (z*) is the primal half of the true saddlepoint for L“, and {6,} is a 
nonnegative sequence satisfying )_ 45, < oo. From [1, theorem 3] it is easy to 
derive an inequality forcing this criterion which in our case turns out to be 


2 
12s get _ gy 
A ey 


< lege G, Daa)4 7(@*")?} 


tl 


é n 
+ SOE e0 q) 19336," rae _ 
j=1 


f=1 
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These quantities are all available to the algorithm once a candidate (ze*? ; getty 
= (X”, 2”) has been selected from step 2. This criterion turns out to be quite a 
convenient one, in practice often being satisfied before {X” } satisfies the relative 
norm test. 

For both sequences {%”} and {Z*} we formulate in a straightforward man- 
ner a stopping criterion based on the duality gap between the primals and their 
respective duals. We calculate 

dual gap (1) = (value of SQP at +1) — value of DQP at 7+") 
dual gap (2) = (value of SQP, at X”) — (value of DQP,, at 2”). 


Then if dual gap (2) is small enough, typically we specify a tolerance of 10~, 
we say that {£”} has converged and set #**) = %”. If dual gap (1) is small 
enough (usually we set the tolerance at 107°) then we say that Z*! “solves” 
SQP and stop. 


15.6 Further Development 


To this date the program has been tested on several problems, and performs 
quite satisfactorily on even a fairly complex problem such as the Lake Balaton 
eutrophication model where literally hundreds of variations have been success- 
fully solved. The Balaton problem was modelled in the form DQP by Somlyddy 
and Wets [8], with c; = d; = 0, consisting of 35 deterministic constraints, 
56 decision variables, 4 stochastic constraints developed from 15 independent 
random variables with a mixture of normal, log-normal and three-parameter- 
gamma distributions. This problem is now solved routinely by our codes. Sim- 
ilar experiences with other (smaller) problems verify that the method is quite 
reliable. 

Typical formulations of the Balaton problem will require the solution of 
between 5 and 20 of the quadratic programs (QP). The amount of work per- 
formed depends on several factors; among these the principal ones appear to 
be the setting of the quadratic parameters 6,,d;,¢;. 

The level and type of refinement of the discrete approximations to the 
measures is also an important feature. The current development program for 
the algorithmis centered on testing various approaches to improve the algorithm 
by modifying the quadratic parameters, as well as on various discretization 
schemes for improving the approximations to the probability measures. 
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15.6.1 Full implementation of proximal point algorithm 


The theory of the finite generation technique does not require that the problem 
SQP be strongly quadratic in order to obtain convergence of the sequence {X”}, 
but if strong quadraticity is present in both primal and dual variables then we 
obtain linear convergence {1]. In problems where d; = 0, the proximal term 
provides the quadratic behavior in the primal variables. In exactly the same 
way we can introduce a dual proximal term to give quadratic behavior in the 
dual variables when e; = 0; this is achieved by setting 


ef =e + 8," to replace e;, 
hf =h'*/6, to replace h,, 


where we recall that (z“, 2") is the approximate saddlepoint of L#~!. This has 
the effect of adding the term $8, 1 E{||z — 2 ||?} to the objective of the dual of 
SQP,. Thus we can have a primal, a dual, or a primal-dual implementation of 
the proximal point method. The saine theory holds in all cases [2]. However it 
is conceivable that one would like to omit the proximal point algorithm in one 
or both sets of variables. The next stage of the algorithm will include facilities 
for making these kinds of choices. 

Thus it will be possible to solve even the completely linear problem, SLP, 
either directly (without proximal terms) or sequentially (with proximal terms). 
Ideally one should introduce the proximal terms only in cases where the finite 
generation method converges poorly, or is unstable. And of course one would 
like to predict the consequences of introducing these terms; for example, one 
would like to know the optimal setting of ¢, for a given problem SQP. The 
basic result is that there is a tradeoff: the higher s, the faster {Z"} converges, 
the lower s, the better {2”} converges {1, theorem 5]. This effect is mediated 
through the quadratic form in SQP. The influence of these forms, and hence 
also the setting of 8", is quite dramatic as can be seen from the following runs 
of the (modified) Lake Balaton problem where the e varied between 0.5 and 
50. 
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Example 2. Lake Balaton Problem 








e = 50 e = 5.0 e=0.5 
Outer inner inner inner 
Loop innerloop duality _ inner loop duality inner loop duality 
Step steps gap steps gap steps gap 
p= 2 0.8 x 107! 2 0.6 x 10°? 3 0.3 x 10-? 
p= 1 0.3 x 107} 4 0.3 x 107? 3 0.3 x 1078 
p= 1 0.7 x 10-? 2 1.0 x 107% (converged) 
p=4 1 1.3 x 107? 2 0.6 x 10-8 
pad 1 0.8 x 10-? 2 0.3 x 10-3 
p=6 1 0.6 x 107? (converged) 
p= 1 0.5 x 107? 
p= 8 2 0.3 x 107? 
p=9 1 0.3 x 107? 
p= 10 7 0.3 x 107? 
p=1l1 2 0.3 x 107? 


(does not converge) 


15.6.2 Discretization schemes for the probability measures 
The augmented Lagrangian techniques coupled with the finite generation 
method constitute an effective algorithm for solving the problem SQP if the 
probability measure is discrete. As originally constituted the implementation 
of the algorithm at IIASA used Monte Carlo techniques to generate a sample 
of the random variables (h,t) and employed the sample as a discrete “empir- 
ical measure”. In this way by increasing the size of the sample we generated 
a sequence of discrete measures p%* which converge in distribution to the true 
measure P. Then by epi-convergence arguments, see [7], the solutions zs to 
the corresponding problems (SQP y,) converge to an optimal solution of SQP. 
The implementation of this “simulation” scheme is quite straightforward. 
One simply stores, for each sample point  € {1,...,.N,} values for the Monte 
Carlo simulations of the random variables h,t, i.e., 


h; < h,(w),w € {1,..., Ne} 
tis ti; (w)w € {1,...,Ne}. 


The finite elements ef k= 1,...,N, are also represented in this way, viz 
EF o EF (w),w € {1,...,No} 


We calculate the integrals as follows: 


ies 
Ebi€'} = 5D hil) el (w), ete. 
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Originally we believed that by beginning with a small sample, say N,, and 
solving SQP y _,, then this solution would be a good starting point for a larger 
sample N2 > N,. thus we envisioned a sequence of problems {SQP Nes each 
one using the last solution as a starting point. But in fact no advantage seems 
to be gained over using the initial point given by the deterministic problem 
with random quantities replaced by expectations. 

A more promising approach to discretizing the probability measure is to 
take advantage of the convexity present in the problem and devise a discretiza- 
tion scheme based on conditional expectations, obtaining a discrete probability 
measure P,. By solving the resulting problem SQP, we obtain a solution = 
giving an optimal value which is a valid upper bound for the true value of (SQP) 
at ° and hence also an upper bound for the true maximum of SQP [7, section 
4]. 

The implementation of this “conditional expectations” representation is 
slightly more involved. Here we present the case where q is fixed (deterministic), 
and t,;;,h,, are all independent. For each random variable we have constructed 
a partition of its support (a subset of the real line) and then we have calculated 
the conditional expectations tcerp,;(k) and heexp;(k)k = 1,...,npart. We 
represent t;;,h,; by the collection of discrete random variables which take values 
tcexp; ;(k), hcexp,;(k) with probabilities tprob,;(k) and hprob,(&). Now let 


r= {7 = (os 115-++s In) fae € {1,...,npart}} 


ie. Tis the set of all (n + 1)-permutations of npart letters. We represent the 
finite elements e*, k =1,...,N, in the following way 


Oo & (7) yer. 
Thus, for example, 
n 
GF =8' (a7 aise '()_ tie? — hi) 
y=1 
is calculated as 
n 
6 (a) = 0" (a; 433671 (D, teexp, ; (1s) 27 — heezp;(r0))), 
J=1 


and the integrals are calculated as 


Efhiet} = J eee, (a0)€F (x) hpr0b, (40) T] tora, (4) 


er j=l 
E{t;€'} = )> teexp;; (v0) €* (a)hprob;(7,) [] tprob, (vs) 
4er g=t 


E(k} = Yo es€it (10 (x)hprob, (10) T] torob,, (1s) 


er j=i 
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(We need only use (n + 1)-permutations as opposed to £(n + 1)-permutations 
because the problem is separable in the + = 1,...,£ stochastic second stage 
constraints.) 

There are some minor advantages to this implementation. Note that we 
do not have to keep in memory values of tcexp,;,hcexp; for each y ET but only 
for each k = 1,...,npart. Since |I'| can be quite large there is a considerable 
saving of memory allocation and access time over the Monte Carlo simulation 
implementation, where typically we would take N, ~ |I'| (in a small problem). 
Furthermore, the resulting problem for the conditional expectations scheme can 
be stated in the standard input format [11]. 

The major advantage is, of course, that we have a valid upper bound. It is 
possible to combine this discretization with a lower bounding approach which 
utilizes the fact that the function 


n 
(b,t) + -c0(g; aise; (tis — bs) 
j=l 
is concave for fixed z, and then develops a measure on the extreme points of the 
partitions of the supports of the random variables assuming they are compact. 
One then develops a sequence of partitions, narrowing the gap between upper 
and lower bounds until a suitable tolerance is attained. For the case where t 
is fixed (deterministic), there is an optimal partitioning scheme [7]. However 
in the case where t is stochastic it is not yet clear what is the correct way to 
proceed. 
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CHAPTER 16 


STOCHASTIC QUASIGRADIENT METHODS 
AND THEIR IMPLEMENTATION 


A. Gaivoronski 


16.1 Introduction 


This paper discusses various stochastic quasigradient methods (see [1], [2]) and 
considers their computer implementation. It is based on experience gained both 
at the V. Glushkov Institute of Cybernetics in Kiev and at IIASA. 

We are concerned here mainly with questions of implementation, such as 
the best way to choose step directions and step sizes, and therefore little atten- 
tion will be paid to theoretical aspects such as convergence theorems and their 
proofs. Readers interested in the theoretical side are referred to [1],[3]. 

The paper is divided into five sections. After introducing the main problem 
in Section 16.1, we discuss the various ways of choosing the step size and step 
direction in Sections 16.2 and 16.3. A detailed description of an interactive 
stochastic optimization package (STO) currently available at IIASA is given 
in Section 16.4. This package represents one possible implementation of the 
methods described in the previous sections. Finally, Section 16.5 deals with 
the solution of some test problems using this package. These problems were 
brought to our attention by other ILASA projects and collaborating institutions 
and include a facility location problem, a water resources management problem, 
and the problem of choosing the parameters in a closed loop control Jaw for a 
stochastic dynamical system with delay. 

We are mainly concerned with the problem 


min{F(r):2€ X},F(z) =E,f (2,0), (16.1) 


where z represents the variables to be chosen optimally, X is a set of constraints, 
and w is a random variable belonging to some probabilistic space (9, B, P). 
Here B is a Borel field and P is a probabilistic measure. 

There are currently two main approaches to this problem. In the first, we 
take the mathematical expectation in (16.1), which leads to multidimensional 
integration and involves the use of various approximation schemes [3-6]. This 
reduces problem (16.1) to a special kind of nonlinear programming problem 
which allows the application of deterministic optimization techniques. In this 
paper we concentrate on the second approach, in which we consider a very 
limited number of observations of random function f(z,w) at each iteration 
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in order to determine the direction of the next step. The resulting errors are 

smoothed out until the optimization process terminates (which happens when 

the step size becomes sufficiently small). This approach was pioneered in [7],[8}. 

We assume that set X is defined in such a way that the projection oper- 

ation z — 2x(z) is comparatively inexpensive from a computational point of 

view, where 7x(z) = aera ||z — zl]. For instance, if X is defined by linear 
ze 


constraints, then projection is reduced to a quadratic programming problem 
which, although challenging if large scale, can nevertheless be solved in a fi- 
nite number of iterations. In this case it is possible to implement a stochastic 
quasigradient algorithm of the following type: 


2°t) = mx (2° — pyv*). (16.2) 


Here 2° is the current approximation of the optimal solution, p, is the step 
size, and v° is a random step direction. This step direction may, for instance, 
be a statistical estimate of the gradient (or subgradient in the nondifferentiable 
case) of function F(z): then v° = €* such that 


E(€*|z',2?,..., 2°) = F,(2°) +a’, (16.3) 


where a° decreases as the number of iterations increases, and the vector v° is 
called a stochastic quasigradient of function F(z). Usually p, — 0 as 8 — oo and 
therefore ||2°t! — z*|| — 0 from (16.2). This suggests that we should take 2° as 
the initial point for the solution of the projection problem at iteration number 
+1, thus reducing considerably the computational effort needed to solve the 
quadratic programming problem at each step s = 1,2,.... Algorithm (16.2)- 
(16.3) can also cope with problems with more general constraints formulated 
in terms of mathematical expectations 


Ev f'(z,w) > 0, =1,m 


by making use of penalty functions or the Lagrangian (for details see [1],[3]). 

The principal peculiarity of such methods is their nonmonotonicity, which 
may sometimes show itself in highly oscillatory behavior. In this case it is 
difficult to judge whether the algorithm has already approached a neighborhood 
of the optimal point or not, since exact values of the objective function are not 
available. The best way of dealing with such difficulties seems to be to use an 
interactive procedure to choose the step sizes and step directions, especially if 
it does not take much time to make one observation. More reasons for adopting 
an interactive approach and details of the implementation are given in the 
following sections. 

Another characteristic of the algorithms described here is their pattern of 
convergence. Because of the probabilistic nature of the problem, their asymp- 
totic rate of convergence is extremely slow and may be represented by 


i C 
|z* — 2°|| ~ —. (16.4) 


JE 
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Here z* is the optimal point to which sequence 2° converges and & is the number 
of observations of random parameters w, which in many cases is proportional to 
the number of iterations. In deterministic optimization a superlinear asymptotic 
convergence rate is generally expected; a rate such as (16.4) would be considered 
as nonconvergence. But no algorithm can do asymptotically any better than 
this for stochastic problem (16.1) in the presence of nondegenerate random 
disturbances, and therefore the aim is to reach some netghborhood of the solution 
rather than to find the precise value of the solution itself. Algorithm (16.2)- 
(16.3) is quite good enough for this purpose. 


16.2 Choice of Step Direction 


In this section we shall discuss different ways of choosing the step direction 
in algorithm (16.2) and some closely related algorithms. We shall first discuss 
methods which are based on observations made at the current, point z* or in 
its immediate vicinity. More general ways are then presented which take into 
account observations made at previous points. 


16.2.1 Gradients of random function f{(z,) 


The simplest case arises when it is possible to obtain gradients (or subgradients 
in the nondifferentiable case) of function f(z,w) at fixed values of 2 and w. In 
this case we can simply take 


&° = fz (2°,w*), (16.5) 


where w* is an observation of random parameter w made at step number s. 
If both the observation of random parameters and the evaluation of gradients 
are computationally inexpensive then it is possible to take the average of some 
specified number NV of gradient observations: 


ee = = f(z?! 0). (16.6) 


These observations can be selected in two ways. The first is to choose the 
w',s according to their probability distribution. If we do not know the form of 
the distribution function (as, for example, in Monte-Carlo simulation models) 
this is the only option. However, in this case the influence of low-probability 
high-cost events may not be properly taken into account. In addition, the 
asymptotic error of the gradient estimate €* is approximately proportional to 
1//N. The second approach may be used when we know the distribution of 
the random parameters w. In this case many other estimates can be derived; 
the use of pseudo-random numbers* in particular may lead to an asymptotic 
error approximately proportional to log(N)/N, which is considerably less than 


* A concept which arose from the use of quasi-Monte-Carlo techniques in 
raultidimensional integration [9]. 
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in the purely random case. However, more theoretical research and more com- 
putational experience are necessary before we can assess the true value of this 
approach. The main question here is whether the increase in the speed of 
convergence is sufficient to compensate for the additional computational effort 
required for more exact estimations of the F,(z*). 

Unfortunately, our theoretical knowledge concerning the asymptotic be- 
havior of processes of type (16.2) tells us little about the optimal number of 
samples, even for relatively well-studied cases. For instance, what would be 
the optimal number N of observations for the case in which function F(z) is 
differentiable and there are no constraints? In this case we can establish both 
asymptotic normality and the value of the asymptotic variance. If, additionally, 
p, C/e then the total number of observations required to obtain a given asymp- 
totic variance is the same for all N <6. If sp, — oo then the wait-and-see 
approach is asymptotically superior as long as N € 6. 

However, there is strong evidence that in constrained and/or nondifferen- 
tiable cases the value of N should be chosen adaptively. A very simple example 
provides some insight into the problem. Suppose that  € R', X = [a,0o), 
F(z) = 2, fe(x*,w*) = 1+ w®*, where the w°,e = 1,2,..., are independent 
random variables with zero mean. The obvious solution of this problem is 
z =a. Suppose for simplicity that p, = p. This will not alter our argument 
greatly because p, usually changes very slowly for large ¢. In this case method 
(16.2),(16.5) will be of the form: 


ott = gf — p(1 +w*) + Te; 


Tt, =max{0,a — 2° + p(1+w’)}. 


Method (16.2),(16.6) requires us to choose a step size N times greater than 
p; otherwise its performance would be inferior to that of method (16.2),(16.5) 
(unless the initial point is in the immediate vicinity of the minimum). Method 
(16.2),(16.6) then becomes 


N 
1 , 
at+l _. ye y 1,8 
ghey Becca )+4., 


N 
& 1 i 
0, = max{0,a—z +No(l+ a Dw ,8)}. 


i=] 


In order to compare the two methods we shall let ¢ in the last equation denote 
the number of observations rather than the number of iterations and renumber 
the observations w',¢. The process 


k-1 k-1 
yr =y-p) (14+) + oxi, (16.7) 
f=0 =0 
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ai Oy . if i #IN for €=1,2,... ora <y'! — p(1 +o") 
Xi a—y +p(1+') otherwise 


has the property that y™ = z° and therefore it is sufficient to compare y* with 
z* for k =IN, where 


k-1 k-1 
zk = 2° -p> (A +u') +o. (16.8) 
1=0 r=0 


Suppose that 2° = y® # a. Then if t} = min{k : z* = a} represents the time at 
which process z* first encounters the optimal point and t! = min{é: y® =a} 
represents the time of the corresponding encounter of process z* with the opti- 
mal point, it is clear that ¢} < ¢) because from (16.7) and (16.8) we have that 
y* = 2* for k<t}. This means that algorithm (16.2) ,(16.5) will get from some 
remote initial point to the vicinity of the optimal point faster then algorithm 
(16.2) ,(16.6) with V > 1. Now let us take 2° = y° = a. Then (16.7) and (16.8) 
imply that x; == 0 for k < _N while 7 may differ from zero. Therefore in this 
case 2" > yN =z! and the performance of algorithm (16.2),(16.6) with N > 1 
becomes superior to that of algorithm (16.2),(16.5) after reaching the vicinity of 
the optimal point. This simple example demonstrates several important prop- 
erties of constrained stochastic optimization problems, although more work is 
necessary before we can make any firm theoretical recommendations concerning 
the choice of the number of samples N. Above all, an appropriate definition of 
the rate of convergence is needed: recent results by Kushner [10] may be useful 
in this regard. 

A rather general adaptive way of changing the number N would be to begin 
with a small value of N for the first few iterations (N = 1, for example), and 
increase N if additional tests show that the current point is in the vicinity of 
the optimum. The following averaging procedure has been shown to be useful 
in tests of this type: 


vett = (1 —a)0? + a9€",0 S ae <1, (16.9) 


where €° is defined by (16.5) or (16.6). It can be shown (see [1], [2]) that 
|v? — F,(2*)|| + 0 under rather general conditions, which include p,/a, — 0. 
The decision as to whether to change N may then be based on the value of 
ts = |/z* — 7x(z° — v°)||. One possibility is to estimate €* and its empirical 
variance at the same time: 


1 ’ 
on = Wy LMa(2" 4") -€) 


and choose N such that of < fr,, where the value of f is set before beginning 
the iterations. In practice it is sufficient to consider a constant a, = a ~ 0.01— 
~-0.05, where the greater the randomness, the smaller the value of a. Our empir- 
ical recommendation for the initial value of N is of ~ 0.1 MaXs 4 ,€X jz 2a]. 
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This method can be used to increase the number of samples per iteration 
automatically. Another possibility is to alter the value of N interactively; this 
is one of the options implemented in the interactive package STO, which has 
recently been developed at IIASA. Numerical experiments conducted with this 
package show that in problems where f,(z,w) has a high variance, choosing 
a value of N greater than one can bring about considerable improvements in 
performance. 

The method described above uses increasingly precise estimates of the gra- 
dient, and therefore shares some of the features of the approximation techniques 
developed in [8-6] for solving stochastic programming problems. All of the re- 
marks made here concerning sampling are also valid for the other methods of 
choosing €° described below. 

However, it is not always possible to use observations of the gradient 
f,(z,w) of the random function to compute a stochastic quasigradient. In 
many cases the analytic expression of f,(2,w) is not known, and even if it is, it 
may be difficult to create a subroutine to evaluate it, especially for large-scale 
problems. In this case it is necessary to use a method which relies only on 
observations of f(z,w). 


16.2.2 Finite-difference approximations 


If function F(z) is differentiable, one possibility is to use forward finite differ- 
ences: 


ei, (16.10) 
f=1 be 
or central finite differences: 
et ae 


t=1 256 
where the e; are unit basis vectors from R". The most important question 
here is the value of 6,. In order to ensure convergence with probability one it is 
sufficient to take any sequence 6, such that prea p35? < oo. If it is possible to 
take w?, = w?, then any 6, — 0 will do. However, the method may reach the 
vicinity of the optimal point much faster if 5, is chosen adaptively. On the first 
few iterations 5, should be large, decreasing as the current point approaches the 
optimal point. The main reason for this is that taking a large step 6, when the 
current point is far from the solution may smooth out the randomness to some 
extent, and may also overcome some of the problems (such as curved valleys) 
caused by the erratic behavior of the deterministic function F(z). One possible 
way of implementing such a strategy in an unconstrained case is given below. 

(i) Take a large initial value of 6,, such as 6, ~ 0.1 maxz, ,29¢x ||21 — 2]. 

(ii) Proceed with iterations (16.2), where €° is determined using (16.10) or 
(16.11). While doing this, compute an estimate of the gradient v*® from 
(16.9). 
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(iii) Take 

$ = be if v° > A146, 

ant Ba6, otherwise ° 


where the values of 8, and #2 should be chosen before beginning the iter- 
ative process. 


It can be shown that this process converges when w?, = w? 9, although it will 
also produce a good approximation to the solution even if this requirement 
is not met. Estimate (16.9) is not the only possibility—in fact, any of the 
estimates of algorithm performance given in Section 16.3 would do. 

Another strategy is to relate changes in the finite-difference approximation 
step to changes in the step size. This is especially advisable if the step size is 
also chosen adaptively (see Section 3). In the simplest case one may fix f, > 0 
before starting and choose 6, = fi fs, which, although contrary to theoretical 
recommendations, will nevertheless bring the current point reasonably close to 
the optimal point. To obtain a more precise solution it is necessary to reduce 
fA, during the course of the iterations. This may be done either automatically 
or interactively; both of these options are currently available in the stochastic 
optimization package STO. 

Finite-difference algorithms (16.10) and (16.11) have one major disadvan- 
tage, and this is that the stochastic quasigradient variance increases as 5, de- 
creases. This means that finite-difference algorithms converge more slowly than 
algorithms which use gradients (16.5). There are two ways of overcoming this 
problem. Firstly, if it is possible to make observations of function f(z, w) for 
various values of z and fixed w, it is a good idea to take the same values of 
w for the differences (i-e., w?, = w?} when 6, is small because this reduces 
the variance of the estimates quite considerably. Another way of avoiding this 
increase in the variance is to increase the number of samples used to obtain &° 
when approaching the optimal point, i.e., to use finite-difference analogues of 
(16.6). If there exists a y > 0 such that N62 > 7, where N, is the number of 
samples taken at step number ¢, then the variance of €° remains bounded. 

It is sometimes useful to normalize the €°, especially when the variance is 
large. 

Another disadvantage of the finite-difference approach is that it requires 
n +1 evaluations of the objective function for forward differences and 2n for 
central differences, where n is the dimension of vector z. This may not be 
acceptable in large-scale problems and in cases where function evaluation is 
computationally expensive. In this situation a stochastic quasigradient can be 
computed using some analogue of random search techniques. 
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16.2.8 Analogues of random search methods 


When it is not feasible to compute n+1 values of the objective function at each 
iteration, the following approach (which has some things in common with the 
random search techniques developed for deterministic optimization problems) 
may be used: 

a pas f(2° + bahi,w? 1) oe f (e*,w? 9) 
ga ye ete) ea), 


= 5, 
f=1 


Here the A; are vectors distributed uniformly on the unit sphere, M, is the 
number of random points and 6, is the step taken in the random search. The 
choice of M, is determined by the computational facilities available, although 
it is advisable to increase M, as 5, decreases. This method of choosing €° has 
much in common with finite-difference schemes, and the statements made above 
about the choice of 5, in the finite-difference case also hold for (16.12). 


(16.12) 


16.2.4 Smoothing the objective function 


Methods of choosing €° which rely on finite-difference or random search tech- 
niques are only appropriate when the objective function F(z) is differentiable. 
The use of similar procedures in the nondifferentiable case would require some 
smoothing of the objective function. Suppose that the function F(z) is not 
differentiable but satisfies the Lipschitz condition, and consider the function 


F(z,r) = [ Fle+u)attty,r), (16.13) 


where H(y,r) is a probability measure with support in a ball of radius r cen- 
tered at zero. We shall assume for simplicity that H(y,r) has nonzero density 
inside this ball. The function F(z,r) is differentiable and F(z,r) — F(z) uni- 
formly over every compact set as r — 0. It is now possible to minimize the 
nonsmooth function F(z) by computing stochastic quasigradients for smooth 
functions F(z,r) and find the optimal solution of the initial problem by letting 
t — 0. This idea was proposed in [11] and studied further in [12]. It is not ac- 
tually necessary to calculate the integral in (16.13)—it is sufficient to compute 
€* using equations (16.10)-(16.12), but at point 2° + y° rather than point 2°, 
where y® is a random variable distributed according to H(y,r,). In this case 
(16.10) becomes: 


2 = ry Sf (2° +y° + 55¢; 5? 1) = f(z° ty" wha) 


5 fs (16.14) 


i=1 


The most commonly used distribution H(y,r) is uniform distribution on an 
n-dimensional cube of side r. If we want to have convergence with probability 
one we should choose r, such that 6,/r, — 0 and (r, — r641)/p5 > 0. In 
practical computations it is also advisable to choose the smoothing parameter 
r, in a similar way to 6,, using one of the adaptive procedures discussed above. 
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Smoothing also has beneficial side effects in that it improves the behavior of 
the deterministic function F(x). In the case where F(z) may be written as 
the sum of two functions, one with a distinct global minimum and the other 
with highly oscillatory behavior, smoothing may help to overcome the influence 
of the oscillations, which may otherwise lead the process to local minima far 
from the global one. Thus it can sometimes be useful to smooth the objective 
function even if we can obtain a gradient f,(2,w). In this case we should 
take a large value for the smoothing parameter +, on the first few iterations, 
decreasing it as we approach the optimal point. The points at which r, should 
be decreased may be determined using the values of additional estimates, such 
as those described below in Section 16.3 or given by (16.9). Everything said 
about the choice of the finite-difference parameter 5, is also valid for the choice 
of the smoothing parameter, including the connection between the step size 
and the smoothing parameter and the possibility of interactive control of r,. 
The only difference is that a decrease in +, does not lead to an increase in the 
variance of €° and that it is preferable to have 5, < r,. This is also reflected in 
the stochastic optimization software developed at ITASA. 

All of the methods discussed so far use only the information available at 
the current point or in its immediate vicmity. We shall now discuss some 
more general ways of choosing the step direction which take into account the 
information obtained at previous points. 


16.2.5 Averaging over preceding iterations 


The definition of a stochastic quasigradient given in (16.3) allows us to use 
information obtained at previous points as the iterations proceed; this informa- 
tion may sometimes lead to faster convergence to the vicinity of the optimal 
point. One possible way of using such information is to average the stochastic 
quasigradients obtained in preceding iterations via a procedure such as (16.9). 
The v® obtained in this way may then be used in method (16.2). This is an- 
other way of smoothing out randomness and neutralizing such characteristics 
of deterministic behavior as curved valleys and oscillations. Methods of this 
type may be viewed as stochastic analogues of conjugate gradient methods and 
were first proposed in [13]. We can choose €° according to any of (16.5), (16.6), 
(16.10), (16.11), (16.12), or (16.14). Since v? — F,(2*) under rather general 
conditions (see [1], [2]), method (16.9) can be considered as an alternative to 
method (16.6) for deriving precise estimates of gradient F(x). This method 
has an advantage over (16.6) in that it provides a natural way of using rough 
estimates of F,,(2°) on the first few iterations and then gradually increasing the 
accuracy as the current point approaches the optimal point. In this case (16.9) 
can be incorporated in the adaptive procedures used to choose the smoothing 
parameter and the step in the finite-difference approximation. 

However, it is not necessary to always take a, — 0, because we have 
convergence for any 0 < a, < 1. Sometimes it is even advantageous to take 
@, = a = constant, because in this case more emphasis is placed on informa- 
tion obtained in recent iterations. In general, the greater the randomness, the 


322 Stochastic Optimization Problems 


smaller the value of a that should be taken. Another averaging technique is 
given by 


1 & 
atl] _ 
v = MM, ) é°, (16.15) 
where M, is the size of the memory, which may be fixed. 


16.2.6 Using second-order information 


There is strong evidence that in some cases setting 
vw=A,€° (16.16) 


may bring about considerable improvements in performance. Here €° can be 
chosen in any of the ways discussed above. Matrix A, should be positive defi- 
nite and take into account both the second-order behavior of function F(z) and 
the structure of the random part of the problem. One possible way of obtain- 
ing second-order information is to use analogues of quasi-Newton methods to 
update matrix A,. To implement this approach, which was proposed by Wets 
in [8], it is necessary to have ||€* — F,(2*)|| 0. 


16.8. Choice of Step Size 


The simplest way of choosing the step-size sequence in (16.2) is to do it before 
starting the iterative process. Convergence theory suggests that any series with 
the properties: 


co co 
f.> 05> Py = co, >> p2 < 00. (16.17) 
é=1 6=1 


can be used as a sequence of step sizes. In addition, it may be necessary to take 
into account relations between the step size and such things as the smoothing 
parameter or the step in a finite-difference approximation. Relations of this 
type have been briefly described in the preceding sections. In most cases the 
choice ps ~ C/s, which obviously satisfies (16.17), provides the best possible 
asymptotic rate of convergence. However, since we are mainly concerned with 
reaching the vicinity of the solution, rule (16.17) is of limited use because a 
wide variety of sequences can be modified to satisfy it. The other disadvantage 
of choosing the step-size sequence in advance is that this approach does not 
make any use of the valuable information which accumulates during solution. 
These “programmed” methods thus perform relatively badly in the majority of 
cases. 

The best strategy therefore seems to be to choose the step size using an 
interactive method. It is assumed that the user can monitor the progress of 
the optimization process and can intervene to change the value of the step size 
or other parameters. This decision should be based on the behavior of the 
estimates F'(z*) of the current value of the objective function. The estimates 
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may be very rough and are generally calculated using only one observation per 
iteration, as in the following example: 


Fe = 1° (eu!) (16.18) 


t=1 


It appears that although the observations f(2°,w®) may vary greatly, the Fe 
display much more regular behavior. Monitoring the behavior of some com- 
ponents of the vector z° in addition to the F* also seems to be useful. One 
possible implementation of the interactive approach may proceed along the fol- 
lowing lines: 

{i) The user first chooses the value of the step size and keeps it constant for a 
number of iterations (usually 10-20). During this period the values of the 
estimate F*® and some of the components of the vector x* are displayed, 
possibly with some additional information. 

(ii) The user decides on a new value for the step size using the available infor- 
mation. Three different cases may occur: 

- The current step size is too large. In this case both the values of the 
estimate F® and the values of the monitored components of z° exhibit 
random jumps. It is necessary to decrease the step size. 

— The current step size is just right. In this case the estimates decrease 
steadily and some of the monitored components of the current vector 
z® also exhibit regular behavior (steadily decrease or increase). This 
means that the user may keep the step size constant until oscillations 
occur in the estimate F* and/or in the components of the current 
vector 2°. 

~— The current step size is too small. In this case the estimate F* will 
begin to change slowly, or simply fluctuate, after the first few itera- 
tions, while the change in 2° is negligible. It is necessary to increase 
the step size. 


(iii) Continue with the iterations, periodically performing step (ii), until changes 
in the step size no longer result in any distinct trend in either the function 
estimate or the current vector 2°, which will oscillate around some point. 
This will indicate that the current point is close to the solution. 


This method of choosing the step size requires an experienced user, but we 
have found that the necessary skills are quickly developed by trial and error. 
The main reasons for adopting an interactive approach may be summarized as 
follows: 


— Interactive methods make the best use of the information which accumu- 
lates during the optimization process. 

— Because the precise value of the objective function is not available, it is 
impossible to use the rules for changing the step size developed in deter- 
ministic optimization (e.g., line searches). 
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~ Stochastic effects make it extremely difficult to define formally when the 
step size is “too big” or “too small”; theoretical research has not thrown 
any light on this problem. 


The main disadvantage of the interactive approach is that much of the 
user’s time is wasted if it takes the computer a long time to make one observation 
{(z°,w*). For this reason a great effort has been made to develop automatic 
adaptive ways of choosing the step size, in which the value of the step size 
is chosen on the basis of information obtained at all or some of the previous 
points z', 7 = 1,8. Methods of this type are considered in [14-20]. The approach 
described in the following sections involves the estimate of some measures of 
algorithm performance which we denote by ®' (zbar®, u*), where Z* represents 
the whole sequence {z!,z*,...,2°} and u® the set of parameters used in the 
estimate. In general, algorithm performance measures are attempts to formalize 
the notions of “oscillatory behavior” and “regular behavior” used in interactive 
step-size regulation, and possess one or more of the following properties: 


— the algorithm performance measure is quite large when the algorithm ex- 
hibits distinct regular behavior, i.e., when the estimates of the function 
value decrease or the components of the current vector z* show a distinct 
trend; 

- the algorithm performance measure becomes small and even changes its 
sign if the estimates of the current function value stop improving or if the 
current point starts to oscillate chaotically; 

~ the algorithm performance measure is large far from the solution and small 
in the immediate vicinity of the optimal point. 


Automatic adaptive methods for choosing the step size begin with some rea- 
sonably large value of the step size, which is kept constant as long as the value 
of the algorithm performance measure remains high, and then decreases when 
the performance measure becomes less than some prescribed value. The be- 
havior of the algorithm usually becomes regular again after a decrease in the 
step size, and the value of the performance measure increases; after a num- 
ber of iterations oscillations set in and the value of the performance measure 
once again decreases. This is a sign that it is time to decrease the step size. 
A rather general convergence result concerning such adaptive piecewise-linear 
methods of changing the step size is given in [18]. However, in many cases it 
is difficult to determine how close the current point is to the optimal point us- 
ing only one such measure-—a more reliable decision can be made using several 
of the measures described below. Unfortunately, it is not possible to come to 
any general conclusions as to which performance measure is the “best” for all 
stochastic optimization problems. Moreover, both the values of the parameters 
used to estimate the performance measure and the value of the performance 
measure at which the step size should be decreased are different for different 
problems. Therefore if we fix these parameters once and for all we may achieve 
the same poor performance as if we had chosen the whole sequence of step sizes 
prior to the optimization process. Thus, it is necessary to tune the parame- 
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ters of automatic adaptive methods to different classes of problems, and the 
interactive approach can be very useful here. An experienced user would have 
little difficulty in using the values of the performace measures to determine 
the correct points at which to change the step size, and in learning what type 
of performance measure behavior requires an increase or a decrease in the step 
size. The interactive approach is of particular use if one iteration is not very 
time-consuming and there are a number of similar problems to be solved. In 
this case the user can identify the most valuable measures of performance in the 
first few runs, fix their parameters and incorporate this knowledge in automatic 
adaptive step-size selection methods for the remaining problems. 

Although interactive methods usually provide the quickest means of reach- 
ing the solution, they cannot always be implemented, and in this case automatic 
adaptive methods prove to be very useful. The stochastic optimization pack- 
age STO developed at IIASA and the Kiev stochastic and nondifferentiable 
optimization package NDO both give the user the choice between automatic 
adaptive methods and interactive methods of determining the step size. Below 
we describe some particular measures of algorithm performance and methods 
of choosing the step size. 

The main indicators used to evaluate the performance of an algorithm are 
estimates of such things as the value of the objective function and its gradient. 
The averaging procedure (16.9) may be used to estimate the value of the gra- 
dient, as described earlier in this paper. The main advantage of this procedure 
is that it allows us to obtain estimates of the mean values of the random vari- 
ables without extensive sampling at each iteration, since a very limited number 
of observations (usually only one) is made at each iteration. This estimate, 
although poor at the beginning, becomes more and more accurate as the iter- 
ations proceed. One example of such an estimate is (16.18), which is a special 
case of the more general formula 


Pet! = (1 — 44) F* + yef(2°,w’). (16.19) 
Any observation 4° with the property 
E(p*|z!,2”,...,2°) = F(2") +d, (16.20) 


can be used instead of f(2°,w*) in (16.19), where d, — 0. For example, (16.6) 
would do. In order to get lim,.. |F* — F(2*)| = 0 it is necessary to have 
s/s» — 0. However, estimate (16.18) assigns all observations of function values 
the same weight. This sometimes leads to considerable bias in the estimate for 
all the iterations the user can afford to run. Therefore for practical purposes it 
is sometimes more useful to adopt procedures of the type described in Section 
2 for the estimation of gradients. These include estimate (16.19) with fixed 
5 =, where y ~ 0.01 — --0.05, and the method in which the average is taken 
over the preceding M, iterations: 


fet f(a‘ ,w!). (16.21) 


M, 
§ j=6—Ms+1 
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Although these estimates do not converge asymptotically to F'(2z°), they place 
more emphasis on observations made at recent points. All of the estimates Fe 
may also be used in an interactive mode to determine the step size, as described 
above. In addition, the values of the parameters used to determine the step size 
may also be chosen interactively. For example, the values of parameters 5; and 
ba in 
by 
by + 8 





Ps = 


can be made to depend on the behavior of Fe, 

We shall now describe some automatic adaptive rules for choosing the step 
size. The important point as regards implementation is how to choose the initial 
value of the step size pp. We suggest that the value of a stochastic quasigradient 
€° should first be computed at the initial point, and that the initial value of 
the step size should then be chosen such that 


polllé’|| ~ D, 


where £ ~ 10——20 and D is a rough estimate of the size of the domain in which 
we believe the optimal solution to be located. This means that it is possible to 
teach the vicinity of each point in this domain within the first 20 iterations or 
so. 


16.3.1 Ratio of function estimate to the path length 


Before beginning the iterations we choose the initial step size po, two positive 
constants a; and a3, asequence M, and an integer M. After every M iterations 
we revise the value of the step size in the following way: 


(i) Compute the quantity 


Fe-Ms a Fe 


¢} (z=, u*) = ale ° 


(16.22) 
Here the u° are the averaging parameters used in the estimation of both 
F® and Mg, while Z is again the whole sequence of points preceding 2°. 
The quantity 


e-1 


q(s,M,) = a IJ2it? — all (16.23) 
i=s—Mgs 
is the length of the path taken by the algorithm during the preceding M, 
iterations. The function @'(z*, u*) is another example of a measure which 
can be used to assess algorithm performance. 
(ii) Take a new value of the step size: 


arp, if 1(F°,u°) < ag 
= rs 1 . 
fer { De otherwise (16.24) 
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In this method the step size is changed at most once every M iterations. 
This is essential because function 4! changes slowly, and if its value is 
less than ag at iteration number ¢ it is likely that the same will be true 
at iteration number e+ 1. Therefore M should lie in the range 5-20. 
This procedure can be modified in various ways, such as continuing for 
M iterations with a fixed step size, then starting to compare values until 
inequality (16.24) is satisfied whereupon the step size is reduced. We then 
wait another M/ iterations and repeat the procedure. Recommended values 
of a; and ag lie within the ranges 0.5--0.9 and 0.005-0.1, respectively. The 
number M, may be chosen to be constant and equal to M. If we have a 
number of similar problems it is very useful to make the first run in a semi- 
automatic mode, i.e., to intervene in the optimization process to improve 
the values of parameters a,,0 ,M — the new values can then be used ina 
fully automatic mode to solve the remaining problems. 


This algorithm is by no means convergent in the traditional sense, but it 
outperformed traditional choices like C/s in numerical experiments because it 
normally reaches the vicinity of the optimal point more quickly. However, it is 
possible to safeguard convergence by considering a second sequence C’/s, where 
C is small, and switching to this sequence if the step size recommended by 
(16.24) falls below a certain value. This step size regulation was introduced in 
[15]. 


16.8.2 Use of gradient estimates 

Take ©? = G® instead of $!(z°,u*) in (16.24), where G® is one of the gradi- 
ent estimates discussed above, and the u® represent all the parameters used, 
including averaging parameters and the frequency of changes in the step size. 


16.3.8 Ratio of progress and path 


The quantity |jz°-™* — 2°|| represents the progress made by the algorithm 
between iteration number ¢ — M, and iteration number s. If we keep the step 
size constant, the algorithm begans to oscillate chaotically after reaching some 
neighb orhood of the optimal point. The smaller the value of the step size, the 
smaller the neighborhood at which this occurs, and thus the total path between 
iterations ¢ and « — M, begins to grow compared with the distance between 
points 2°-™s and 2°. This means that the function 


Jet Mo — | 


— 16.25 
Dize—M, lle!t} - | om 


$3 (z,u°) = 


can be used as a performance measure in equation (16.24). 
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16.8.4 Analogues of line search techniques 

The decision as to whether (and how) to change the step size may be based 
on the values of the scalar product of adjacent step directions. If we have 
(€°-!,€*) > 0, then this may be a sign that regular behavior prevails over 
stochastic behavior, the function is decreasing in the step direction and the 
step size should be increased. Due to stochastic effects the function will very 
often increase rather than decrease, but in the long run the number of bad 
choices will be less than the number of correct decisions. Analogously, if this 
inequality does not hold then the step size should be decreased. The rule for 
changing the step size is thus basically as follows: 


pe if —a, <(€°1,€*) Say 
Pot1 = 4 Gaps if (€° 1, €*) > ay ; (16.26) 
agp, if (€°1,€°) <—a, 


where the values of a1, a2, a3 (recommended values a; ~ 0.4-—-0.8,1 < a2 < 
1.3 and 0.7 < a3 <1) should be chosen before starting the iterations. It is also 
advisable to have upper and lower bounds on the step size to avoid divergence. 
Sometimes it is convenient to normalize the vectors of step directions, i.e., 
|€°|| = 1. The lower bound may decrease as the iterations proceed. This 
method may also be applied to the choice of a vector step size, treating some (or 
all) variables or groups of variables separately. A number of different methods 
based on the use of scalar products of adjacent step directions to control the 
step size have been developed by Uriasiev [19], Pflug [16], and Ruszczynski and 
Syski [20]. 


16.4 ITASA Implementation 


The interactive stochastic optimization package implemented at IIASA (STO) 
is based on the same ideas as the package for stochastic and nondifferentiable 
optimization developed in Kiev (NDO). It allows the user to choose between 
interactive and automatic modes and makes available the stochastic quasigradi- 
ent methods described in Sections 2 and 3. In the interactive mode the program 
offers the user the opportunity to change the step parameters and the methods 
by which the step size and step direction are chosen during the course of the tt 
erations. The user can also stop the iterative process and obtain a more precise 
estimate of the value of the objective function before continuing. The package 
is written in FORTRAN-77. 
Before initiating the optimization process the user has to: 


(i) Provide a subroutine UF which calculates the value of function f(z,w) 
for fixed z and w and, optionally, a subroutine UG which computes the 
gradient f,(z,w) of this function; the function evaluation subroutine should 

FUNCTION UF(N,X) 
DIMENSION X(N) 
Calculation of f(z, w) 
RETURN 
be of the form: END Here N is the dimension of 
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the vector of variables X. (Note that the implementation on the IIASA 
VAX actually requires the subroutine to be entered in lower-case letters 
rather than capitals.) A description of a subroutine which calculates a 
quasigradient is given later in this paper. 
(ii) Compile these subroutines with the source code to obtain an executable 
module. 
(iii) Provide at least one of the following additional data files: 
— algorithm control file (used only in the noninteractive option) 
— parameter file (used only in the interactive option) 
— initial data file (should always be present) 


All of these files are described in some detail later in the paper. 


The optimization process can then begin. The program first asks the user 
a series of questions regarding the required mode (interactive or automatic), 
method of step size regulation, choice of step direction, etc. These questions 
appear on the monitor and should be answered from the keyboard or by refer- 
ence to a data file. We shall represent the dialogue as follows: 


Question? Answer 
The first question is 
interactive mode? reply yes or no — yes/no 


To choose the interactive option the user should type in yes (or y); to select the 
automatic option he should answer no (or n). In the latter case the program 
would ask no further questions, but would read all the necessary information 
from the algorithm control file (which is usually numbered 2—under UNIX con- 
ventions its name is fort.2). The iterative process would then begin, terminating 
after 10,000 iterations if no other stopping criterion is fulfilled. The algorithm 
control file must contain answers to all of the following questions except those 
concerned either with dialogue during the iterations or with the parameter file 
(such questions are marked with an asterisk * below). This file is given a name 
only for ease of reference-- the important thing for the user is its number. 

Assume now that the user has chosen the interactive option by answering 
yes to the first question. The program then asks 


parameter file? (number) te 


The user should respond either with the number of the file of default parameters 
or with the number of the file in which the current values of the algorithm 
parameters are stored. The file of default parameters is provided with the 
program and has the name fort.12 (under UNIX conventions); thus, to refer 
the program to the default file the user should answer 12. The purpose of this 
file is to help the user to set the values of algorithm parameters in the ensuing 
dialogue and also to store such improved values as may be discovered by the 
user through trial and error. If the user assigns the algorithm parameters any 
values other than those in the default file, the new values become the default 
values in subsequent runs of the program. This file is optional. 
The program then asks 
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read parameter file? reply yes or no — yes/no ~ 


The answer yes implies that the file specified in the previous question 
exists, and that default parameter values are stored in this file. In this case, 
when asking the user about parameter values, the program will read the default 
option in the parameter file and reproduce it on the screen together with the 
question. If the user accepts this default value he should respond with @ (zero); 
otherwise he should enter his own value, which will become the new default 
value. 

The answer no means that no default values are available at the moment. 
In this case the program will form a new default file (labeled with the number 
given as an answer to the previous question); its contents will be based on the 
user’s answers to future questions. This new default file, once formed, can be 
used in subsequent runs. 

The next question is 


number of variables? (number) 


to which the user should respond with the dimension of the vector of variables 
x. He is then asked 


initial data file? (number) 


and should reply with the number of the initial data file. This file should contain 
the following elements (in exactly this order): 


— The initial point, which should be a sequence of numbers separated by 
commas or other delimiters. 

— Any additional data required by subroutines UF or UG if such data exists 
and the user chooses to put it in the initial data file (optional). 

— Information about the constraints (described in more detail below) 


The program then asks 
step size regulation? :s 
Here is is a positive integer from the set {1,2,3,4,6,7}, where the different 


values of 7g correspond to different ways of choosing the step size. (The integer 
5 is reserved for an option currently under development.) 


ts Definition 


1 Adaptive automatic step size regulation (16.24) based on algorithm per- 
formance function (16.22) and function estimate (16.18). 

2 Manual step size regulation based on algorithm performance function (16.22) 
and function estimate (16.18). 

3 Adaptive automatic step size regulation (16.24) using algorithm perfor- 
mance measure (16.22) and a function estimate based on a finite number 
of previous observations (16.21). 

4 Manual step size regulation based on the same estimates of algorithm per- 
formance as for is = 3. 

6 Automatic step size regulation using algorithm performance measure (16.24) 
and function estimate (16.19) with fixed 7,. 
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7 Manual step size regulation based on the same estimates of algorithm per- 
formance as for ie = 6. 


The difference between adaptive automatic and manual step size regulation 
(see ¢¢ = 1,2) is that in the first case the step size is chosen automatically, 
although the user may terminate the iterations at specified points and continue 
with another step size regulation, while in the second case the user changes the 
value of the step size himself. Both step size regulations are based on the same 
estimates of function value and algorithm performance. 

The next question is 

step direction? (5 figures) id1 id? id3 id4 idd5 

The user has to respond with five figures which specify various ways of choosing 
the step direction, e.g., 11111. We shall refer to these figures as idl, :d2, id8, 
id4 and id5. The subroutine which estimates the step direction makes some 
number of initial observations eye at each step; these are then averaged in 
some way to obtain the vector €°, and the final step direction v° is calculated 
using both €° and values of v' for i < 8. 

The value of :di specifies the nature of the initial observations e. 8. 


td1 Definition 


1 A direct observation of a stochastic quasigradient is available for é ,@ and 
the user has to specify a subroutine UG to calculate it: 
SUBROUTINE UG(N,X,G) 
DIMENSION X(N),G(N) 
Calculation of a stochastic quasigradient 
RETURN 
END 
where G(N) is an observation of a stochastic quasigradient. 
2 Central finite-difference approximation of the gradient as in (16.11). 
3 The €,¢ are calculated using random search techniques (16.12). 


4 Forward finite-difference approximation of the initial observations @ eas 
in (16.10). 
5 Central finite-difference approximation of the gradient as in (16.11). All 


. . . . at . 
observations of the function used in one observation of € , 8 are made with 
the same values of random parameters w. 


6 The Ris are calculated using random search techniques (16.12). All ob- 


s . * i. = ° 
servations of the function used in one observation of € ,s are made with 
the same values of random parameters w. 


7 Forward finite-difference approximation of the initial observations é, 8 as 
in (16.10). All observations of the function used in one observation of €,8 

are made with the same values of random parameters w. 
Note that for sd 1= 5,6, 7 all observations of the function used in one observation 


of é, g are made with the same values of random parameters w. In this case 
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the user should write a function UF which supports this feature as follows: 

FUNCTION UF(N,X) 
DIMENSION X(N) 
COMMON /OMEG/LO,MO 

If LO=1 and MO=1 then obtain new values 

of random factors w and set MO=0. Make 

an observation of the function at point z. 
RETURN 
END 


The second figure id? determines the point at which observations are made: 
td? Definition 
1 The initial direction is calculated at the current point z° 
2 The initial direction is calculated at a point chosen randomly from among 
those in the neighborhood of the current point z° 


The value of s43 defines the way in which the step in a finite-difference or 
random search approximation of € ,s is chosen: 
td3 Definition 

1 The approximation step is fixed. The observations of the objective function 
at point z° originally used to obtain gradient observations € ,6 are not used 
to update the estimate of the function employed for step size regulation. 

2 The ratio 6,/p, of the step in the finite-difference approximation to the 
step size of the algorithm is fixed (see (16.10)-(16.12)). The observations 
of the objective function at point z° originally used to obtain gradient 
observations zibar‘,s are not used to update the estimate of the function 
employed for step size regulation. 

3 The approximation step is fixed. The observations described for :d9= 1,2 
above are used to update the current estimate of the objective function. 

4 The ratio 5,/, of the step in the finite difference approximation to the 
step size of the algorithm is fixed (see (16.10)-(16.12)). The observations 
described for :d3= 1,2 above are used to update the current estimate of 
the objective function. 


The fourth figure :d{ defines the type of averaging used to obtain €* from 
observations é 8. 
td4 Definition 

1 No averaging, 22” = e, s,i= 1. 

2 Number of samples > 1. 
The value of #5 specifies the way in which the final step direction v® is obtained 
from previous values of v° and from 27°. 
td5 Definition 


1 No previous information is used. The final vector v° is simply set equal to 


zu’, 
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2 (16.9) is used. 

3 A positive number ng is provided by the user. Set k(¢) = max{k : kng + 
1 < e}. Then the final direction v* is computed from (16.15), where 
M, = 8 — k(e)ng +1. 

4 No previous information is used. The final vector v® is set. equal to €* and 
is normalized. 

5 (16.9) is used. The final vector v* is normalized. 

6 A positive number ng is provided by the user. Let k(¢) = max{k : kng + 
1 < 6}. Then the final direction v* is computed from (16.15), where 
M,=eé- k(e)ng +1. The final vector v* is normalized. 


The program then asks about the type of constraints present in the problem: 
constraints? (number) 


The answer (in the present implementation) must be 1,2,3 or 4. These values 
define the type of constraints present and correspond to the following options: 


1 There are no constraints at all. 

2 There are upper and lower bounds on the variables. The values of these 
bounds should be given at the end of the initial data file in the form of 
strings of numbers separated by commas or other delimiters. The string 
containing the upper bounds should come first. 

3 There is one constraint )>/_., a;2; <4. The coefficients a; should be given 
at the end of the initial data file. The string containing the coefficients of 
linear form comes first and then, on a separate line, the right-hand side. 

4 There are general linear constraints b; < Az < 6,. In this case the program 
computes a projection on these constraints at each iteration, using the 
quadratic programming package SOL/QPSOL [21]. The previous point 
2°~1 is used as the initial approximation to the solution at iteration number 
e. The precision of projection also varies, being rough during the first few 
iterations and improving as the process proceeds. All of these facilities are 
intended to reduce the amount of computation required at each iteration. 
The following information should appear at the end of the initial data file 
(in exactly this order): 

® upper bounds on variables z 
e lower bounds on variables z 
e upper bounds 6, on general linear constraints 

lower bounds 8; on general linear constraints 

number of nonzero elements in matrix A 

numbers of nonzero elements in the columns of matrix A 

nonzero elements of matrix A in increasing order of column number 

row numbers of nonzero elements, in the same order as the elements 

themselves 


The next question is 


termination condition? (number) 
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There is currently only one possible answer, which is 1. This means that the it- 
erations terminate when the step size becomes smaller than some value specified 
by the user. Additional options are under development. 

The program then asks the user whether the interactive mode is required 
during the iterations: 


interactive mode during iterations? reply yes or no yes/no* 


Note that the answer to this question should not be included in the algorithm 
control file for the completely noninteractive option (as indicated by the aster- 
isk). If the user replies yes (or y), the program will allow the user to change the 
parameters of the algorithm and even the algorithm itself during the course of 
the iterations. If the answer is no (or n) the program will not communicate with 
the user during the iterations but will instead ask the following two questions: 


number of iterations? (number) 


This is the number of iterations that should be performed before the process 
terminates (if it has not already been terminated by some other condition). It 
is necessary to put an answer to this question in the algorithm control file for 
the completely non-interactive option. 


extra output? reply yes or no = yes/no 


This is the program’s way of asking the user whether information about the 
iterations should be saved. Note that these two questions do not. appear if 
the user has chosen to run the program in the interactive mode during the 
iterations. 

Now comes a group of questions about step direction parameters. These 
questions depend on the values of id1, «d?, «d3, sd{ and :d5 given previously 
(see the discussion of answers to the question step direction?). 

If <di= 4,5 then the question 


number of random directions? (number) 


appears. The required answer is M, from (16.12). 
If id2= 2 the user is asked 


relation between step size and neighborhood? (number) 
The answer is the ratio of the step size to the size of the neighborhood (of the 
current point) from which the observation point is chosen ({i.e., re/p. in the 
discussion of (16.13)). 

If sd3= 1,3 and ¢di! =1 the program asks 

step in finite difference approximation? (number) 
The required answer is the value of step 6, in the finite-difference or random 
search approximation (16.10)-(16.12) of the gradient observation. In this case 
6, is fixed. However, if id9= 2,4 the question 

relation between step in finite difference 

approximation and step size? (number) 
appears. The answer is the ratio 5,/, of the finite-difference approximation 
step to the algorithm step size. 


Quasigradient Methods 335 


If id{= 2 the program asks 
number of samples? = (number) 
This is the number of samples taken at one point to obtain the averaged estimate 
(see, for instance, N in (16.6)). 
The question 
discount rate? (number) 
appears if id5= 2,5. The required answer is the (fixed) value of a, from (16.9). 
However, if <d5= 3,6 the program asks 
number of averaging steps? (number) 
The user should respond with the value of ng (see earlier discussion of id5 
options). 

We now have a group of questions concerning the values of step size pa- 
rameters. Which questions appear depends on the way in which the step size 
is being chosen (see earlier discussion of the question step size regulation’). 

If the user has chosen automatic step size regulation (te = 1,3,6) he will 
be asked the following four questions: 

initial step size? (number) 
This is po. 
multiplier? (number) 
The required answer is a, from (16.24). 
frequency of step size changes? (number) 
The user should give the value of Af (see discussion of (16.24)). 
lower bound on function decrease? (number) 
This is ag from (16.24). 

However, if the user has chosen to regulate the step size interactively (ts = 

2,4,7) he will only be asked 

value of step size? (number) 
The following questions appear only if there are general linear constraints, i.e., 
if the answer to the question constraints? is 4: 

number of general linear constraints? (number) 

correspondence between step size and 

accuracy of projection? (number) 
The answer to the first question is obvious but the second requires some expla- 
nation. In order to keep the amount of computation to a minimum, the accuracy 
Ts of projection is linked to the value of the step size: r, = ¢p,. This leads to 
only rough projection during the first few iterations (when the step size is large) 
and more precise projection as the current point approaches the optimal point. 
The required answer to the last question is the value of ¢; recommended values 
lie in the range 0-1. 

Another group of questions is concerned with the estimates of the objective 
function and also affects the choice of step size: 
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size of memory? (number) 


The answer is M, from (16.22), which in this implementation is fixed. If the 
step size regulation is defined by ie = 6,7 the program asks 


multiplier for funetion averaging? (number) 


The user should give the value of 7, in (16.19), which is fixed. 

With the answers to these questions the algorithm control file for the non- 
interactive option is complete. The rest of this section describes the ways in 
which the algorithm parameters and the algorithm itself may be modified dur- 
ing the course of the iterations. This may be done only if the answer to the 
question “interactive mode during iterations? reply yes or no” was yes. In this 
case the program will now perform the first iteration and produce a string of 
information something like this: 


1 0. 7505.826 7505.826 0. 1.000 100.458 109.575 


Here the first number is the number of the current iteration, the second is the 
value of some algorithm performance measure (see (16.22), (16.25) for exam- 
ples of such functions), the third is the estimate of the value of the objective 
function at the current point {see (16.18), (16.19), (16.21) for examples of such 
estimates), the fourth is an observation of f(z°,w*), the fifth currently has no 
meaning and always contains 0, the sixth is the step size, and the rest are values 
of variables 2? (the default is that only the values of the first two such variables 
are displayed). After this string the following question will appear: 


continue? reply “space” ,step,dir,var,estim,go,yes or no : 


This gives the user the opportunity to continue without any change, to alter the 
frequency of communication, to change the step size or step direction parame- 
ters, to display variables other than the first two, to stop at the current point 
and obtain a precise estimate of the value of the objective function, to switch 
from interactive to automatic mode, or to terminate the iterations and continue 
the solution with another algorithm. We shall now describe all of these options 
in some detail. 


“space” If the user hits the space bar nothing will change and the program 
will perform another 10 iterations. The information about the process is 
displayed after each iteration; after the 10-th iteration the user is once again 
given the opportunity to make changes (the question “continue? reply 
“space” step...” appears). 

step This means that the user wants to change the step size parameters 
(but not the step size regulation itself) and all the related questions will 
be repeated. Default or previous values of the step parameters will appear 
on the screen together with the questions. 


dir This means that the user wants to change the step direction parameters 
(but not the way in which the step direction is chosen) and the questions 
concerned with this will be repeated. Default or previous values of the 
direction parameters will appear on the screen together with the questions. 
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var In this case the quantity and/or the selection of variables displayed on 
the screen may be changed. The following questions will appear: 
number of printed variables? (number) * 
i.e., if the user wants to print out the values of four variables rather than 
the default two, he answers 4. 
printed variables? (number, number,....) a 
Here the user specifies which particular variables he wants displayed by 
giving the numbers of the chosen variables separated by commas. Ques- 
tions concerning the frequency of communication will also appear here (see 
description of response yes below). 
estim In this case the program will stop at the current point and estimate 
the value of the objective function. The following questions will appear: 


number of observations? (number) * 
i.e., the number of observations to he made, and 
message frequency? (numéer) * 


i.e., the number of observations after which the current estimate is dis- 
played. The user is also asked for the point at which the estimate should 


be made: 
what point? reply current, new or exit current/new/eatt * 
If the answer is new the program asks the question: 
* 


where to find new point? reply screen or file screen/file 


If the user wants to enter the new point from the keyboard he should 
reply screen (or 6). He should then type the desired point on a new line, 
separating the components by commas. If, however, the new point is stored 


in some file the response should be file (or f) and the user is then asked 


file number? = (number) * 


The answer is obviously the number of the file containing the new point. 
This new point is taken as the starting point for future iterations if the 
user answers yes to the following question: 


replace current point by new? reply yes or no yes/no - 


which appears when the estimation of the objective function at the new 
point has been completed. This facility makes it possible to exchange 
the current point for an arbitrary point chosen by the user and also to 
make precise estimations at arbitrary points. Finally, if the answer to the 
question “what point? reply current, new or exit” is eat the estimation 
procedure will end and the iterations will continue. go This means that the 
user does not want to continue in the interactive mode; he wants the process 
to proceed automatically. This is useful once the algorithm parameters 
have been established and also in the case when one iteration is very time- 
consuming. The user is then asked 


number of iterations? (number} - 
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ie., the total number of iterations before termination. After this the pro- 
gram has no more communication with the user and terminates after the 
specified number of iterations. yes In this case the frequency of communi- 
cation can be changed. The following questions appear: 


output frequency? (number) - 


This is the number of iterations after which information about the process 
is displayed on the screen (the default value is 1, i.e., a string of information 
is printed after every iteration). 

* 


dialogue frequency? (number) 
This is the number of process information strings (see above) printed before 
the user is asked the question “continue? reply space,step,dir,var,estim,yes 
or no”, The default is 10, i.e., the user is given ten strings of informa- 
tion about the process before he is asked whether he wishes to make any 
changes. no This means that the user wishes either to terminate the iter- 
ations or change the method. The program asks: 


continue? reply “space”,yes or no “space” /yes/no bs 


Here hitting the space bar means that the user wishes to proceed with 
the iterations using the same method, maybe returning to the initial point 
(see below); yes means he wishes to change the way in which the step size 
and/or step direction are chosen (the program will ask further questions 
about this—see below); no means that he wishes to terminate the iterations 
completely (some self-explanatory questions will then appear). If the user 
answers “space” or yes the program will ask 


return to initial values? reply yes or no yes/no bs 


and the user should give the appropriate response. 


The very first appearance of the question “continue? reply space,step,dir, 
var,estim,yes or no” is followed by the question 


least value of step size? (number) * 


The answer is the least permissible value of the step size. If the current step 
size is less than this value then the iterations will terminate. In other cases 
the process terminates after 10,000 iterations with a question about whether to 
continue or not. 

Everything that appears on the screen during the interactive dialogue au- 
tomatically also goes to file number 15 (fort.15 in UNIX). This makes it possible 
to study the process after it has terminated. 

This section provides some idea of the capabilities of the package of sto- 
chastic optimization subroutines STO available at IIASA. The implementation 
described here is the first version, and development of the second continues. 
This revised version will include methods for solving certain special problems, 
in particular problems with recourse, and new methods for step size regulation 
will be introduced. 
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16.5 Some Numerical Experiments 


16.5.1 Facility location problem 


We first. consider a simple model of facility location in a stochastic environment. 
Suppose that we have to determine the armounts z; of materials, facilities, etc., 
required at points ¢ = 1,n in order to meet a demand w;. The demand is 
random, and all we know is its distribution function P{w; < @j,...,Wn < 
On} = H(@). The actual value w = (wi,...,wn) of the demand is not known 
when the decision concerning the z = (z,...,2,) has to be made. Assume that 
we have made a decision z about the distribution of facilities and then found 
that the actual demand isw. We have to pay for both oversupply and shortfalls, 
i.e., the penalty charged at the i-th location is ¥{ (w; — 2;) if w > 2; and 
$(z; —w;) if w; < 2;, where the functions i (y) and 4 (y) are nondecreasing. 
In the simplest case these functions are linear and the total penalty for fixed z 
and w is )>/_, max{a;(w; — 2;),6;(z; — w;)}, where a; > 0,3; >0,7=T,n. In 
most cases it is reasonable to select z in such a way that the average penalty is 
at a minimum, i.e., to minimize the following function: 


F(z) =E,,f(2,w) =E, Yo max oi ~ 81) 0; (a5 — we)} = 


[ Somaxtost — 2;),b; (x; — w,;)}dH (w). (16.27) 


This approach can easily be generalized to deal with more complex facility 
location models (see [1],[15],[22]). The numerical experiment presented here 
is basically an application of the facility location model described above to the 
problem of high school location in Turin, Italy (see [15],[22]). In this example 
n is the number of districts in the city (16.23 in this case), w; is the number 
of students who want to attend schools in district 7, and 2; is the capacity of 
schools in district z. It is assumed that a student living in district 2 will choose 
a school in district 7 with probability p;;, where 
ez 
Yai e Ni 


and ¢;; is proportional to the distance between districts ¢ and 7. The values 
of cj; are taken from [15], as are the values of the parameters (\ = 0.15 and 
a; = b; = 1.0 for all 2). The demand w; is assessed by assigning individual 
students to a school in a particular district on the basis of probabilities p;;, 
thus simulating the student’s choice of school. In order to reduce the amount of 
computation the number of students was scaled. Table 16.1 gives the resulting 
solution (the number of places that should be provided), together with the total 
number of students actually attending schools in each district. 


Py = 
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Table 16.1 The solution of the problem of high school location in Turin, 
Italy {15],[23] 


District | 1 
Number of | 
students | 
Solution | 
District | 
Number of | 
students | 
Solution |! 
District | 
Number of | 
students | 


Solution | 





All real data was divided by a scaling factor of 100. We also have the constraint 
d-1 21 = M, where M is the total number of students in the city divided by 
100 (339 in this case). Once w has been obtained it is quite easy to calculate 
a stochastic quasigradient. We can use vector €° = (€¢, €8,...,€%) in method 
(16.2), where 


ef = —a; ifw? > 2? 
i Lb ifwe < 28° 


Here w? is the demand in district ¢ (calculated by simulating the students’ be- 
havior) at iteration number ¢, and 2? is the 7-th component of the solution at 
this iteration. The initial point was obtained by assuming that each student 
goes to school in his native district. After extensive averaging, the value of the 
objective function at this point was found to be 74.2—the optimal value is 55.9. 
We shall first present results obtained using the interactive option for chang- 
ing the step size, i.e., results obtained by giving the answer 2 to the question 
“step size regulation?” The step direction was specified as 11111, i.e, a direct 
observation of a stochastic quasigradient is available, this observation is made 
at the current point, the approximation step is fixed, there is no averaging, and 
no previous information is used. The size of the memory available for calculat- 
ing the performance measure (16.22) was set at 10. Table 16.2 reproduces the 
information displayed on the monitor during the first 30 iterations. 


Quasigradient Methods 341 


Table 16.2 Information displayed during the first 30 iterations (facility lo- 
cation problem, interactive step size regulation) 


Iter. [Performance _teanimabe Sree) Step 
measure [F° of F(2°)lof f(z°,w*)| size 


-0.335 73.696 1.000 | 13.435 | 19.435 








The observations of f(z°,w*) given in Table 16.1 do not provide any clues 
as to whether the algorithm is improving the values of the objective function 
F(z*) or not. At first sight these observations appear to oscillate randomly 
between 40 and 80. By contrast, the estimates F* of the function F (z°) display 
much more stable behavior, generally decreasing during the first 22 iterations 
from 73 to 64 and then stabilizing around the values 63-64 with some small 
oscillations. Looking at the behavior of the two selected variables, we see that 
their values show a steady increase or decrease until iteration number 8 for 
z4 and iteration number 5 for 223. In later iterations both variables exhibit 
oscillatory behavior. The value of the performance measure during the first 
4 iterations is negative, due to the instability of the initial estimates. It then 
begins to increase and reaches approximately 0.2, reflecting the regular behavior 
of the estimate f,. After this it decreases in an oscillatory fashion to the range 
0.03-0.06. All of this indicates that it is time to decrease the step size. 
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Table 16.3 Information displayed during iterations 31-59 (facility location 
problem, interactive step size regulation) 


a IP erformance shabmate Observation) Step 
measure [F° of F(x*)lof f(2°,w*)| size L4 293 


0.045 62.379 42.783 0.500 18.087 17.087 








i 0.025 62.295 62.783 0.500 18.261 16.261 
35 0.052 61.652 52.609 0.500 19.391 16.391 
37 0.063 61.565 46.957 0.500 19.348 16.348 
39 0.079 61.318 52.261 0.500 19.261 17.261 
41 0.050 61.211 68.174 0.500 19.174 16.174 
43 0.051 60.815 51.304 0.500 18.261 16.261 
45 0.070 60.452 57.913 0.500 17.304 16.304 
47 0.059 60.279 45.652 0.500 17.348 15.348 
49 0.035 60.277 64.957 0.500 18.391 15.391 
51 0.043 60.104 61.739 0.500 18.652 14.652 
53 0.017 60.133 64.696 0.500 18.565 14.565 
55 0.017 60.240 67.043 0.500 18.652 14.652 
57 0.030 60.819 65.565 0.500 18.565 15.565 
59 -0.052 61.189 85.391 0.500 18.609 16.609 








After changing the step size, the estimates of F(2*) decreased steadily 
during iterations 31-51, and then started to increase during iterations 52-59 
(see Table 16.3). The performance measure first increased, reaching a level of 
0.05-0.07 between iterations 35 and 47 before dropping back to negative values. 
It is necessary to decrease the step size once again. 


Table 16.4 Information displayed during iterations 62-80 (facility location 
problem, interactive dic size peeulation), 











We decided to stop after iteration number 80 (see Table 16.4) and estimate 
the value of the objective function at the current point. The average after the 
first 500 observations was 56.53, which shows that we are fairly close to the 
optimal solution. Note that this estimate is considerably lower than the value 
of F* (61.0) given in the table. This is due to the fact that the estimate Fe 
is calculated from (16.18) including only one additional observation f(x*,w*) 
per iteration, and it therefore includes observations made at early points which 
are clearly far from the optimum. Nevertheless, this estimate is still useful in 
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determining the value of the step size because it reflects the general behavior 
of the algorithm. Subsequent iterations improved the value of the objective 


function only marginally (see Table 16.5). 


Table 16.5 


Information displayed during iterations 90-3070 (facility location 


problem, interactive step size regulation) 


Iter. Performance} Estimate 





no. measure [F'® of F(z°) 
90 0.063 60.601 
100 0.143 59.876 
120 0.022 59.579 
140 0.061 58.890 
160 -0.011 59.161 
180 0.319 58.761 
200 0.008 58.608 
300 0.317 57.847 
400 —0.368 57.627 
500 0.270 57.584 
800 ~0.830 57.012 
1100 3.773 57.071 
1570 1.521 56.858 
2070 0.916 56.629 
2570 -0.874 56.603 
3070 0.118 56.425 





Our final estimate of the objective function was 56.0, which is close to the 


optimal solution. 


The same results can be obtained by automatic regulation of the step size. 
In this case we give the answer / to the question “step size regulation?”, i.e., 
adaptive automatic step size regulation (16.24) using function estimate (16.18). 


We also set 


initial step size 


multiplier 


frequency of step size change 


lower bound on function decrease 


size of memory 


(see the description of the step size parameters in Section 16.4). The results 


are presented in Table 16.6. 











0.02 
15 


Step 
size Za 

0.200 17.930 
0.100 18.287 
0.100 18.330 
0.100 18.626 
0.100 19.226 
0.020 19.379 
0.020 19.237 
0.020 18.946 
0.005 18.909 
0.005 18.869 
0.001 18.967 
0.0003 | 18.980 
0.0001 | 18.983 
0.0001 | 18.975 
0.0001 | 18.978 
0.0001 | 18.982 
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Table 16.6. Information displayed during iterations 2-1200 (facility location 


problem, adaptive automatic step size regulation) 


Iter. [Performance] Estimate Observation} Step 
no. measure |F? of F(2*)|of f(z°,w*)| size 


3.663 77.826 60.261 1.000 
1.590 72.522 57.739 1.000 
1.091 69.232 54.522 1.000 
0.892 65.457 48.174 1.000 
0.736 63.609 56.087 1.000 
0.453 64.980 65.652 1.000 
0.071 64.435 58.522 1.000 
0.023 64.304 49.652 1.000 
0.007 61.951 49.391 1.000 
0.017 61.563 68.696 0.700 
0.017 60.593 90.195 0.490 
0.017 60.246 65.349 0.240 
0.054 59.526 48.282 0.082 
0.036 59.277 50.012 0.028 
—0.035 58.495 58.695 0.020 
-0.100 58.440 63.486 0.010 
0.143 57.936 36.450 0.007 
0.446 57.683 47.760 0.003 
~0.024 57.387 43.263 0.003 
0.412 57.116 50.086 0.002 
0.430 57.006 43.503 0.001 
~0.063 56.726 76.801 0.001 
0.165 56.623 65.457 0.001 





10.739 
12.739 
14.826 
14.826 
16.913 
18.130 
17.522 
19.783 
17.609 
15.104 
18.665 
20.166 
19.657 
19.131 
19.074 
18.903 
18.913 
18.955 
18.945 
18.975 
18.947 
18.969 
18.989 


20.739 
18.739 
20.826 
18.826 
18,913 
18.130 
19.522 
15.783 
15.609 
15.104 
18.245 
16.855 
17.223 
17,248 
16.999 
16.986 
16.984 
16.998 
16.995 
16.958 
16.969 
16.997 
16.994 





The value of the objective function at the final point (average of 4000 
observations) is 56.2, which is close to the optimal value. The behavior of the 
algorithm was virtually the same as in the interactive case: quite a reasonable 
approximation of the optimal solution was obtained after 100-150 iterations, 


with little improvement being observed thereafter. 
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16.5.2 Control of water resources 


This example is taken from work by A. Prékopa and T. Szantai. An extended 
description of the problem together with a solution obtained by reduction to a 
special type of nonlinear programming problem is given in [28]. Here we shall 
show how the problem can be solved using stochastic quasigradient methods. 
The basic aim is to control the level of water in Lake Balaton (a large, shallow 
lake in western Hungary). A certain volume of water w; flows into the lake 
from rivers, rainfall, etc., in time period 7. This inflow varies randomly from 
one period to another, but it is possible to derive its probabilistic distribution 
from previous observations. The control parameter is the amount z, of water 
released from the lake into the River Danube in each time period; the objective 
is to maximize the probability of the water level lying within specified bounds. 
It turns out that a reasonable control policy can be determined by considering 
only two consecutive periods of time, which in this example are measured in 
months. After appropriate transformations we arrive at the following problem 
{for details see [28]): 
max P{Z(21 ) z2)} 
F152 
0<2,<R 
O< 2 <R, 


where the set Z(z ,22) is defined as follows: 
Z(21,22) = {(w1,09): 41 Sw — 2 <b1, a9 S wg — 21 — 2g < bg}. 


Here a;, 6; are respectively the lower and upper bounds on the “generalized” 
water level: in this particular example we took a; = ag = —205, b; = bg = 95, 
R = 200. The random water inputs w, and wg have a joint normal distribution 
H(w;,w9) with expectations E(w,) = —28.07, E(w) = —59.43 and covariance 


matrix 
Ou eas meet 


4660.51 10121.36 


Let x (21,29, w1,W) denote the indicator function of the set Z(x,,79), ie., 


x(21, 22,6) ,0) = te eee = 2 (2,22) 7 


The problem then becomes 


may fx (21,2 501,09) 4H (41 09) 
rEX 

and can be solved using stochastic quasigradient methods. We took (95,95) as 
the initial point; the value of the objective function at this point was 0.32. Ac- 
cording to [28], the optimal solution is (2,0), with an objective function value 
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of 0.857. We decided to solve the problem using a finite-difference approxima- 
tion of a stochastic quasigradient. Below we demonstrate how our interactive 
software package STO may be used to solve this problem, specifying interactive 
step size regulation (option 2) and step direction 21124, (i.e., taking a central 
finite-difference approximation of the gradient, calculating the step direction at 
the current point, with a fixed approximation step, a number of samples greater 
than 1, no previous information, and such that the step direction vector has 


unit norm). 


The parameters were set at the following values: 


step in finite difference approximation 10.0 


number of samples 
value of step size 
size of memory 


The results are given in Table 16.7. 


Table 16.7 Information displayed during iterations 1-110 (water manage- 


10.0 


ment problem, interactive step size regulation) 





Iter. [Performance] Estimate Observatio 


no. | measure |F* of F(#*)lof f(2°,w®) 
1 0. 0. 0. 
2 1.000 0. 0. 
4 0.025 0.250 1.000 
6 0.011 0.333 0. 
8 0.007 0.375 0. 
10 0.006 0.400 0. 
15 0.003 0.333 0. 
20 0.002 0.350 0. 
30 0.001 0.467 0. 
40 0.000 0.475 1.000 
50 0.000 0.500 1.000 
60 0.000 0.567 1.000 
70 0.000 0.571 1.000 
80 0.000 0.588 1.000 
90 0.000 0.600 1.000 
100 0.000 0.610 1.000 
110 0.000 0.609 1.000 


Step 
size 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
10.000 
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102.071 
102.071 
106.543 
113.614 
106.543 
106.543 
83.944 
68.397 
18.240 
48.678 
41.277 
0. 
1.056 
1.386 
0. 
7.071 
10.000 


102.071 
102.071 
93.127 
110.198 
113.127 
93.127 
101.254 
90.630 
93.229 
63.727 
29.097 
43.004 
30.405 
14.142 
24.142 
20.000 
0. 





After iteration 110 we stopped and estimated the value of the function at 
the current point on the basis of 4000 observations—we obtained a value of 
0.843, which is close to the optimal value. Subsequent iterations improved the 
value of the objective function only marginally (see Table 16.8). 
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Table 16.8 Information displayed during iterations 120-8090 (water man- 
agement problem, interactive step size regulation) 


Performance _Estimate 
measure |F'* of F(z‘) 





102.071 | 102.071 
0.106 1.707 
2.707 6.309 
3.071 7.835 
1.787 8.110 
3.463 6.392 
0.383 5.538 
0.161 4.895 
0.071 5.049 
0.064 4.955 
0.106 4.980 
0.016 4.970 
0.020 4.985 








After iteration 200 we changed the step in the finite-difference approxima- 
tion to 1.0. The value of the objective function at the final point was 0.85, 1.e., 
we had reached the optimal value. However, the values of the controls were far 
from the solution due to the flatness of the function around the optimum. 


16.5.3 Determining the parameters in a closed loop control law for 
stochastic dynamical systems with delay 

We have so far considered only static optimization problems. However, all of the 
techniques described above can also be applied to many classes of dynamical 
stochastic optimization problems. The example that we shall consider was 
suggested by A. Wierzbicki and is the problem of finding the optimal control 
parameters in a closed loop control law for a linear dynamical system disturbed 
by random noise. The state equations include response delay and may be 
written as follows: 


241 ES Eee Se 0,7 (16.28) 

zo =1,u_; =0,1=0,k, 
wliere ¢ is a discrete time, 2; is the state of the dynamical system at time f, w 
is the value of the control at time ¢, and u; is the random noise at time t. In 
this particular example the «; were taken to be distributed uniformly over the 
interval [—4,6] and such that w; and w; are uncorrelated for ¢ # 7. However, 
neither this particular type of distribution nor these correlation properties are 
prerequisites for the use of the methods described in the preceding sections. 
The controls u, were chosen according to the following closed loop control law: 


t 
up =a, (—% ~ 22 > 27), (16.29) 
r= 
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where the decision parameters are z; > 0 and z2 > 0. 

The objective is to minimize the deviation of the state of the system from 
zero. We may therefore state the problem as follows: minimize the objective 
function 


Tt 
F (21,22) =E. > # (16.30) 
t=1 


with respect to the control law parameters z, and zg, subject to constraints 
(16.28) and (16.29) and nonnegativity constraints on 21,22. We solved the 
problem with the following parameter values: time horizon T = 100, delay 
k = 5, state equation coefficient a = 0.9, bounds for random noise 6 = 0.1. 
With these values the optimal control parameters are x; = 0.1, 22 = 0; the 
value of the objective function obtained after 10,000 observations was 4.52. It 
was discovered during preliminary runs that for 2, > 0.3, z2 > 0.1 the system 
becomes unstable and therefore these values were taken as upper bounds for 
the variables. 

We set the initial point equal to the upper bounds 2? = 0.3, 29 = 0.1; the 
value of the objective function at this point (based on 3000 observations) was 
422.56. We chose automatic step size regulation (option 1), i.e., the step size 
changes are based upon performance function (16.22). The step direction was 
specified as 71114, i.e., taking a forward finite-difference approximation of the 
gradient of the random objective function f({z,w) with all observations of the 
function needed for one gradient evaluation made at the same value of the noise; 
with a fixed finite difference step and the finite-difference evaluation performed 
at the current point; without averaging; using no previous information and 
normalizing the resulting step direction. The parameters of the algorithm were 
as follows: 


step in finite difference approximation 0.0001 


initial step size 0.1 
multiplier 0.85 
(for diminishing the step size) 

frequency of step size change 15 


(actually the frequency with which 

the step size is reviewed) 

lower bound on function decrease 0.09 
(the lowest value of performance 

function (16.22) which does not lead 

to a decrease in the step size) 


size of memory 15 
{for evaluating (16.22)) 
least value of step size 0.000001 


(stopping criterion) 


The results of the calculations are given in Table 16.9. 
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Table 16.9 Information ara during iterations 1-120 (control law prob- 
lem, automatic step size regulation) 


eesssceessseseoses 








Table 16.10. Information displayed during iterations 150-1500 (contro! law 
problem, automatic step size regulation) 
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We stopped after iteration 120 to estimate the value of the objective function, 
which was calculated to be 4.54 after 3000 observations and is fairly close to 
the optimal value. Subsequent iterations improved the solution only marginally 
(see Table 16.10). 

This example once again demonstrates the characteristic behavior of sto- 
chastic optimization algorithms: the neighborhood of the optimal solution is 
reached reasonably rapidly; oscillations then occur in this neighborhood and 
the current approximation to the optimal solution improves slowly. 

The nature of stochastic quasigradient. algorithms allows easy extension of 
model (16.28)--(16.30) to multivariable and nonlinear systems. 
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CHAPTER 17 


STEPSIZE RULES, STOPPING TIMES AND THEIR 
IMPLEMENTATION IN STOCHASTIC QUASIGRADIENT 
ALGORITHMS 


G.Ch. Pflug 


1. Introduction 
We consider the constrained optimization problem 


f(z) = Ep(q(z,-)) = min! res (17.1) 


where S is a closed, convex set of constraints S C R*. The symbol Ep or 
briefly E denotes the expectation with respect to the probability measure P 
which is defined on some measurable space (1, 4) 


Ep(q(z, €)) = | ale,e)aP (9. (17.2) 


There are, in principle, two different ways of attacking the problem (17.1): 


(a) Reduction to deterministic optimization 

The easiest situation arises if the integral (17.2) may be calculated analytically. 
In that case the problem (17.1) reduces to a deterministic constrained optimiza- 
tion problem. But even if there is no closed-form analytical representation of 
(17.2) the integral may be approximated with arbitrary accuracy. This may be 
done by approximating the probability measure P by a sequence P,, such that 
P,, — P (in an appropriate sense) to guarantee that 


[oleaP.(9 > f ole, ear 


and the first integrals are easy to calculate. Very often discrete measures are 
used for Py. Another possibility is to calculate Ep(q(z,€)), directly by Monte 
Carlo or quasi Monte Carlo methods. 
(b) Stochastic quasigradient method 


For this group of methods it is not necessary to get good approximations of 
Ep(q(z,-))-stochastic estimates suffice. If € is a random variable (random 
number or random vector) with distribution P then 


Q: = ¢(z,€) 
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is a random variable with expectation E(Q;) = f(z). A statistical approxi- 
mation of the gradient V/f(z) of the objective function f may be obtained by 
considering a difference approximation 


(Wa), = gplale + hei) —a(2— her 6)] 


where (Y), denotes the z-th component of the vector Y and e; are the i-th unit 
vectors. Then, if f(z) is twice differentiable 


Ep(Yx) = V(x) + 0(A’). 


Such a random vector Y, is called a stochastic quasigradient, giving the method 
its name. Only the stochastic quasigradient (SQG) approach will be considered 
in this paper. 

Sometimes there are even unbiased estimates of Vf(x) available. This is 
e.g. the case if 

ar q(z, €) 

is differentiable in the L'(P)-sense. This means that there is a vector of L!- 
functions Vg(z,€) such that 


/ la(zs, €) — a(#2,€) — (21 - 22)'Va(z, €)|dP(€) = o(ll21 — 22). (17.3) 


In that case evidently 
E(Vq(z, €)) = VS (z). 


It is important to notice that the following chain of implications holds 


{q(z, €) differentiable for every € and L,-dominated] —> 
[q(z,-) differentiable in the L,-sense] => 
[{ (2) differentiable] 


The converse implications are not true as can be seen from the following exam- 
ples. 


Example (a). 

Let (e-@ if é 
a(z— if > 

ala€) = es if <é 


Such a specification is often encountered in economic applications where a de- 
notes the surplus costs and } the shortage -osts of a random demand 6, 2 being 
the offer. x ++ (2, €) is not a differentiable function. However, if € is integrable 


then ; é 
a ifz> 
vale.={%, fe <€ 


is the L'-derivative of q(-,-) as one can see immediately. 
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Example (b). 
Let 

_jJ0 iff+EeEA 
atee)=! ifzt+égA 


where A is some predetermined region in R*. Such problems arise in optimal 
control, if the probability that the control z plus noise € lies in the set A 
should be maximized. This function g(z,€) is not L!-differentiable although 
the function 


Ep(a(z,:)) 


is differentiable if P has a density w.r.t. Lebesgue measure. 
A similar notion holds for subdifferentiable functions: A R*-valued random 
variable Y; is called stochastic subgradient if 


Ep(Yz) € Of (z) 


This is again a weaker statement than the pointwise sub differentiability of z > 
q(z, €). 
What concerns the smoothness properties of our problem (17.1) we may 
distinguish two cases: 
(a) the functions g(z,-) are L'-(sub)differentiable 
(b) the functions g(z,-) are not L'-(sub)differentiable but the expectations 
f(z) are. 


The stochastic quasigradient method uses a recursively defined stochastic se- 
quence X, to approximate the solution of (17.1): 


Xn+1 = Ts (Xn — pnYxn) (17.4) 


where in case (a) 
Y, = Va(z,€,) (17.4a) 


and in case (b) 


= q(z + has €,) — q(x > hy, €,) 


Ys Qha 


; (17.40) 
Here {€,,} is a sequence of j.i.d. random variables with distribution P and IIs 
denotes the projection onto the closed convex set S. The nonnegative constants 
Pn represent the stepsizes. 

The use of algorithms of the form (17.4a) goes back to a pioneering paper 
by Robbins and Monro [17]. Kiefer and Wolfowitz studied for the first time 
stochastic minimization problems with difference approximations for the gra- 
dients. It is however important to notice that the iterative process (17.4a) is 
not a Kiefer-Wolfowitz process. This is so because we have used the same ran- 
dom element €,, two times in the definition of Y,. Consequently, under some 
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mild assumptions the variance of Y, satisfies Var(Y;) = 0(h, 1). The original 
approach of Kiefer and Wolfowitz uses the quantities 


(1)y Ls (2) 
Y,= q(z thn, €} de q(2 lings ) (17.4c) 


with €") independent of e0), In that case Var(Y;) = 0(h;7) only. If the 
randomness comes from generated random numbers then the use of (17.4b) 
guarantees a certain variance reduction which is impossible in the case when the 
randomness stems from measurement errors which was the situation considered 
by Kiefer and Wolfowitz. 

In the following we shall restrict ourselves to consider the case (17.4a). 
The reason for doing so is that the behavior of (17.4b) depends much on the 
smoothness properties of g(-,-). If this function is L?-differentiable, then the 
variance of Y, is bounded as in the case of (17.4a). Otherwise the variance of Y, 
given by (17.4b) may increase with decreasing hy. But although the convergence 
theorems for the case (b) procedure require that h, — 0 with increasing n it is 
reasonable to keep h, away from zero in practical implementations. Otherwise 
numerical difficulties are encountered by forming the quotient in (17.4b). Thus 
for practical applications we may assume that the variance of Y, is bounded 
any way. 

The convergence properties of the iterative sequence X, given by (17.4) 
were studied by many authors. These properties include almost sure con- 
vergence of X, to 2* the solution of (17.1). (Dvoretzky [8], Kushner and 
Clark [18], Ermoliev [4], Hiriart-Urruty [9] among others) rates of convergence 
(Schmetterer [20]), asymptotic laws (Blum [1], Fabian [7]) laws of iterated 
logarithms, etc. 

We shall give here a simple but illustrative a.s. convergence result for 
random step sizes. Randomness does not mean here that a random line search 
is made but the stepsize ,, may depend on the information obtained up to the 
n-th step (i.e. an adaptive stepsize rule). Denote by x* the solution of the 
problem (17.1) which is assumed to be unique. The o-algebra generated by 
€,,€3,---,€,_-1 is denoted by ¥,. Moreover we shall assume that 


(i) (V/(2),2—2°) > ae —2"/P 
(i) V/() | < A+ Ble 2°? 
(iii) Var(¥;) <C 
(iv) pn > 0, fn is ¥, — measurable 


Theorem. 
(i) Under the above conditions 


Pn =00 a.s. >  <coas. 


implies that X, — x* a.s. 
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(ii) If f(z) is convex and S is bounded then pp + 0 a.s. and D) pa = 00 
implies that X,— z* a.s. where X,, is the weighted mean of the process 


ca ea pi Xi 


. Sa a 


Mirozahmedov and Uryasev [14]. 
Outline of the proof. According to the assumptions 


E(\|Xno1 — 2° ||? |Fn) < [Xn — 2° |]? — 2en(VS (Xn), Xn — 2°) 
+ pi (A+ BX, — 2°?) + oC 
= |[Xn— 2° |?(1+ Ba) —9n + Hn (say) 


with Bn, qn, Hn > O and the series > 8, and )> fin converge a.s. By a theorem 
due to Robbins and Sigmund [18] this implies 


|X, — 2* ||? converges a.s. and 


yr (Vi (Xn), Xn - z*) <oo as. (17.5) 


Part (i) of the theorem follows, since (17.5) implies that 
ay PalXn—2*|? <oo as. 
which together with }> pp = 00 and the a.s. convergence of || X, — 2*||? gives 
Xn — «||? 3-0 as. 
In order to prove part (ii) of the theorem we introduce the notation Y,, := Yx,, 


and Zn := Yn — E(Yn). 
By iterating the recursion we find that 


n 
0< [Xn — 2"? < Xo — 2? — 29 os(V F(X), Xi — 2°) 
=0 


-_ 2) pi(Z, rj =o} + Ss eli)? 
1=0 1=0 
Because of the convexity 
Yo (VS (Xi), Xi — 2") > DO ve (F(X) - F(2*)) 
i=0 1=0 


> (Ra) — Fe") Do av 


f=1 
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Hence 


S| 


0< f(Xn) -S(2") Ss 


(x: r) [tx —2°|?- >» pi(Z;, X; — 2”) 
+DA wr 
- (17.6) 


It can easily be deduced from the assumptions that the right hand side of (17.6) 
converges to zero a.s. implying thus 


X,72 as. O 


The just proven theorem gives conditions for the stepsizes p,, which are so 
general that they cover a variety of cases. On the other hand, it does not tell us 
which stepsize rule is good, or even the best. All choices fulfilling }> p?, < oo, 
> Pn =o lead to a.s. convergence. A detailed study of such rules follows in 
the next section. 


17.2 Stepsize rules and stopping times 


Almost sure convergence results are only of limited importance for practical 
purposes. It is much more important to design a procedure which stops after 
a finite number of steps within a netghborhood of the solution 2* which has 
predetermined size. To put it more formally let || - ||p be a certain norm in IR* 
and a resp. € two constants representing the desired confidence level and the 
size of a confidence region. An approximation procedure X,, is of practical use 
only in connection with a stopping time r = r(a,¢) such that X,, the process 
stopped at 1, satisfies 


P,,{||X; —2" \|> < e} >l-a V2 (17.7) 


where P;, denotes the law of the process {X,} started at Xo = 29. Formula 
(17.7) is nothing else than the definition of a fixed width confidence region of 
level a. 

Unfortunately exact level a confidence regions are difficult to obtain, even 
in the much simpler case of the sequential estimation of a mean value (see e.g. 
Chow and Robbins [3]). It is much easier to get asymptotic level a confidence 
regions. This is a family of stopping times {7,} such that 


lim Pry {|Xre-2*\|D<Se}2zl—a Veo (17.8) 


It may happen that the speed of convergence depends heavily on the starting 
value zp. In that case the actual significance level of a confidence region may 
be arbitrarily low if the starting value was poorly chosen. 
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It is therefore useful to consider a uniform version of (17.8). In particular 
we call {7,} a family of uniform asymptotic level a confidence regions if 


Him inf Peo {\|Xre — 2" |p Se} 1-2 (17.9) 


A stopping time fulfilling (17.9) is considered to be robust against bad influences 
of the starting value. In order to get error estimates some knowledge about the 
speed of convergence in (17.9) is of great help but usually difficult to obtain. 
It is important to stress that stopping times must be seen in connection 
with stepsize rules. Typically a certain rule for determining the stepsizes leads 
to a certain asymptotic behavior of the process X, which in turn is the basis 
for the definition of a stopping rule r. On the other hand one may also define 
stepsizes on the basis of a sequence of increasing stopping times by changing 
the stepsizes (say by multiplication with 1/2) exactly at these times. Thus the 
interrelations between stepsizes and stopping times are rather close. 
We shall now define some common stepsize rules and the pertaining stop- 
ping times. Recall that 
(i) enF,-measurable (17.10) 
(ii) > Pn=COO 4.3, 
(iti) 5 p2<co as. 
are the minimal conditions to guarantee convergence. 
(a) Deterministic stepsize rules (DSR) 


The simplest rule consists in taking {p,} as a sequence of nonrandom constants 


fulfilling (17.10), e.g. 


1 log x 
tn = G5 <8 <1orp, = 8", 
nP 2 


The quickest rate of convergence is achieved by taking 


(17.11) 


in = 


Silo 


which is by far the most popular choice. 
Many asymptotic results are known if the stepsizes are chosen according 
to (17.11): If the solution z* is an interior point of S then 
Vn(Xn — 2*) + N(0,5) (17.12) 


where © is the solution of the matrix equation 
I eae _ 
(pA = > + (pA i a) = p C (17.13) 


with C being the covariance matrix of g(z*, €) 


C = Cov(q(z", €)) (17.14) 
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and A is the hessian of f at 2* 
Vi (2) = A(z — 2*) + 0(||z — 2* ||’). (17.15) 


Here — denotes the convergence in distribution. Equation (17.13) may be made 
explicit for £ either by writing 


r=, [ orl FPA) gt $04) dy 
0 


or by introducing the vec operation ( which transforms a matrix into a vector 
by putting the columns one above the other) 


vecL =(I@(pA- 5) + (pA'- 5) @I)7 vec C. 


It is important to notice that the asymptotic distribution (17.12) is independent 
of the starting value z). There is even a much stronger result known. Consider 
the random function 


t 
Za(t) = kal O<t<l1 (17.16) 


({z] denotes the integer part of x). The random process Z,(£) contain the whole 
information of the approximating sequence X,,X2,...,Xp up to time n. It may 
be shown that 
Zalt) — i (2 )(04—aw (tu) (17.17) 
(0,1) 


Where W is a Gaussian process with statiionary independent increments in IR* 
processing the covariance matrix Cov(W (1)) = C (see Walk [22]). Functional 
limit theorems of type (17.17)—sometimes called “invariance principles” —help 
very much to get a deeper insight into the pathwise behavior of the approximat- 
ing process {X,}. In particular large deviation results or laws of the iterated 
logarithm may be based on result (17.17). 

Moreover a stopping time leading to asymptotic level a confidence regions 
may be derived from the asymptotic distribution (17.17). If A and C are known 
and the norm || - ||p is defined by ||z||p = /2’Dz for a positive definite matrix 
D then ‘ 

a 


oa (17.18) 


is a family of stopping times satisfying (17.8), i.e. 
lim P{||X7. -—2°\lp <e}>1-a. 
«€— 

Here xq denotes the upper a-quantile of the distribution of 


R= Z'DZ where Z ~ N(0,5) (17.19) 
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Unfortunately this distribution is tedious to calculate. There is however a good 
approximation by a I’-distribution 


tr?(Dx) 2 tr((DX)*) ) 


2tr((D5)2)’  #r(DE) 720) 


R-~ approximately [ ( 


where I'(a, 9) has the density 


-1 
2%! exp(—2/) sage 
AT (a) 

Approximation (17.20) is based on the comparison of the first two cumulants of 
the distributions (see Kendall and Stuart [10]). Notice that the I’-distribution 
degenerates to a x?(tr(DX)) distribution if DZ is idempotent. 

If A and C are not known, they have to be estimated during the procedure. 
(A possible method of estimation is indicated in the last section.) Suppose that 
An resp. C,, are consistent estimates of A resp. C. Then a given by 


(pAn = 


consistently estimates L. 
Let Ko,n be the upper a-quantile of the I'-distribution (17.20) where © is 


replaced by £,. Then we may define the 
(a’) deterministic stepsize stopping time (DST): 


te = inf {n|2" <e} (17.21) 


By using the functional limit law (17.17) one may prove that 7, leads to an 
asymptotically unbiased confidence region, hence satisfies (17.8). 

A quite similar result holds if the point of solution z* lies on the boundary. 
Denote by K* the tangent cone to S at the point z* and suppose that H is 
the largest linear subspace contained in K*. It may then be proved that the 
limit law of /n({X,, — 2*) is again a normal distribution but concentrated on 
H (see Pflug [16]). Thus the constrained situation may be reduced to the 
unconstrained by considering only the projection of the Hessian matrix A and 
the covariance matrix C onto the subspace H. The situation is however different 
if dim H = 0. In that case S is pointed at z* and the asymptotics are different. 
This is however a rather unlikely case. 

The big disadvantage of the rules (a) and (a’) lies in the fact that the 
pertaining confidence regions are not at all unzform in the sense of definition 
(17.9). This is clear since everything was based on asymptotic formulas which 
do not reflect the influence of the starting value zp. This is in fact the most 
important reason for making these rules such less competitive in practice. 
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The idea behind more elaborated stepsize rules is clear: Some information 
concerning the progress of the procedure should be gathered during the approx- 
imation process and should influence the actual stepsizes. Some possible ways 
of doing so are listed below. 


(b) The adaptive stepsize rule (ASR) 


This rule was formulated for the first time in Mirozahmedov and Uriasev [14]. 
Let Y,, denote the n-th stochastic gradient, i.e. Y, = Yx,. The rule is to adapt 
Pn according to the inner product (Yn, Y¥n-1), ie. 


Pn+1 = Pn eXp[apn{¥n41,¥n) — Spal (17.22) 


where a and 6 are some fixed constants. The motivation for this choice comes 
from deterministic optimization since there a rule of the form 

increase pn if (¥n41,¥n) > 0 

decrease py, if (Yn+1,¥n) <0 
leads to an optimal speed of convergence. 

The term —ép,, in the exponent of (17.22) is added to guarantee the con- 

vergence of p, to zero. Mirozahmedov and Uriasev show that the assumptions 
of the convergence theorem part (ii) are fulfilled and hence 


X,—7 2" as. 


The same rule but with § = 0 was studied by Rusczynski and Syski [19]. Some 
comments on this rule can be found in Section 17.3. 
The stopping criterion pertaining to this rule is 


(b’) The adaptive stopping time (AST) 


r =inf{nlon < 9°} (17.23) 


This time does not lead to a confidence region with fixed size. 


(c) The decrease of objective function rule (DOSR) 


This rule is based on a recursive estimate /,, of the objective function E,(q(z, €)) 
namely 


fo = (Xo, €0) 
Sane = (1 = Bn)dn + Bng(Xn+i€nti)- 


The constants 8, determine the degree of smoothing, e.g. 2, = § (exponential 
smoothing ) or 8, = (n +1)! (arithmetic mean). The stopping rule itself 
employs 

f 


: fn—My—in 
if n-M < 
Pn+1 = ae" ree WX;~Xy_—7 Ih - 


Pn otherwise 
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Here 7, < 1 and 72 > 0 are fixed constants and {M,} is an (increasing) 
sequence of nonnegative integers. Thus p, stays either constant or decreases 
by a factor. The pertaining stopping time is 


(c’) The decrease of objective function stopping time (DOST). 
th =inf{nlon < p*} 


Unfortunately there are no general properties known of this rule. 
(d) The ratio of progress stepsize rule (RPSR) 


This rule is similar to the above but measures the progress in the argument. 


. |Xn-Xn_ Mall 
a if =a Se 
Pn+1 = a ae eee WX;-X,;_ 4 ll ; 
Pn otherwise . 


Again the stopping time is 
(d’) The ratio of progress stopping time (RPST) 


Tn =inf{n|pn < p*} 


Both preceding rules suffer from the defect that the last M, steps (with My, 
increasing) have to be kept in memory. They are described in Ermoliev and 
Gaivoronski [5]. 


(e) The oscillation test stepsize rule (OTSR) 


This rule keeps the stepsize constant as long as some statistical test indicates 
that the behavior of the path is pure oscillation and no progress in the objective 
function is made. Then the step size is decreased by some factor. 

Consider the procedure (17.4) with fixed stepsize 


Xn4i = Ws(Xn — p¥ xy) (17.24) 
This is by construction a time-homogeneous Markov process. Under some weak 
regularity conditions this process is ergodic, i.e. it converges in law to the 


unique stationary measure of (17.24). Let X? be a stationary sequence of this 
Markovian process. It may be shown that if 2* lies in the interior of S then 


p72 (X2 — 2*) + .N(0,2) (17.25) 
as p — 0 where » is a solution of 


AL +=ZA’ =C (17.26) 
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with A resp. C given by (17.14) resp. (17.15). The similarity of (17.26) to 
(17.13) is interesting to notice. Again D may be calculated from A and Cas 


oO 
= / exp(uA)- C- exp(uA’)du 
0 


or 
vecL =(I@A+A'@I)"! vecC. 


Thus for small but constant p the process X, converges in law to a normal 
distribution with mean z* and covariance matrix pu. This corresponds to 
the well known fact that for fixed stepsize p the process approaches first some 
neighborhood of the solution and begins to oscillate around it afterwards. The 
OTSR makes a decision for decreasing the stepsize by testing whether this 
oscillatory behavior is already reached. As a test statistic we may use the 
inner product of subsequent gradients V, = (Yn, Yn-1). If the sequence X,, is 
stationary and has the limiting distribution (17.25) then 


E(V,) = ptr(A’ A(Z — pA)X) — ptr(AC). 


If A is symmetric and p & 1 this expression may be approximated by 
1 
E(Va) = —etr(AC). 


If X,, is not yet oscillating E({Y,,,¥,-1) is typically much larger. The unknown 
matrices A and C' may be estimated consistently by An resp. Gn. By equation 
(17.26) this leads also to an estimate &,, of S. 

The OTSR is defined by a sequence of stopping times {vp} which are 
defined recursively by 


vo =0 
n 


dX Wis¥i-1) < (17.27) 


iy f=¥ntl 





Va4+1 = inf {n|— 


“~ 


S pn tr(A,An(I = pnAn)ZAn) — Pn tr(AnCn) + an} 
The stepsizes py are defined as 
Pn = po- x} fory,<n<yj+1 (17.28) 


Thus p,, is decreased by the factor y3 <1 exactly at the times v;. 


(e’) The oscillation test stopping time (OTST) 


A pertaining stopping time is also based on (17.25) employing the same ideas as 
were used in (17.20). Thus let «a,n be the upper a-quantile of the I'-distribution 


(17.20) with & replaced by &,, then the stopping times 7,_ are 





me inf {n|22" <<} (17.29) 
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This family of stopping times leads to exact level a confidence regions. Moreover 
under mild assumptions these regions are uniform in the starting value and thus 
satisfy (17.9). 

If the solution point z* lies on the boundary then the result should be 
modified as in the DSR-case (a). Again the largest hyperplane H contained 
in K*, the tangent cone to S at 2* carries the whole mass of the asymptotic 
distribution. By projecting everything onto the space H this case may be 
reduced to the unconstrained case. Details of the algorithm are presented in 
Section 17.4. 


(f) The inner product stepsize rule (IPSR) 


It has been pointed out in section (e) that the expectation of the inner product 
Vn = (Yn, Yn-1) of two subsequent gradients is negative, if the process is oscil- 
lating. This fact can be used for the definition of a very simple stepsize rule. 
Instead of comparing E(V,,) with the asymptotically correct, but. complicated 
expression given in formula (17.27) only the sign of E(V,) is considered. 

More precisely the IPSR is defined by a sequence of stopping times {v,,} 





: 1 . 
tyes int fal py winx) so} 
si t=vntl 


The stepsizes are—as in the OTSR~defined as 
Pn = poy! for yj Sn <yj41 


It is evident that the IPSR decreases the stepsizes at an earlier stage than 
the OTSR. This is sometimes desirable since the more complicated estimations 
which are needed for the oscillation test rule are only valid for small ». Thus a 
good compromise is to begin with the simple inner-product rule which provides 
a fast convergence to a neighborhood of the solution. If the stepsizes p, are 
small enough then the rule should be switched to OTSR. By such a procedure 
one avoids the very quick decrease of p, in a later stage of the approximation 
process. 


(g) A review of other stepsize rules and stopping times 


There are many other rules known, some of which are restricted to the univariate 
case. Kesten [11] proposed for instance to choose 


pr =t 


_ fen  ifsgnY,-1 =sgn Yn 
Pat = ari if sgn Y,_1 #AsgnY, and py, = + 





and showed a.s. convergence of this procedure. 
Farell [8] considers also the univariate case and defines a stopping time of 
the following kind. Suppose it is known that the solution z* lies in some interval 
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a) <2’ < a), Then one may start two independent procedures X, (1) and 
X®) with initial points x 
1 —_— 
xe = 22) 
and stop if for the first time 
[Xn —2n?| Se 


This procedure leads to asymptotic confidence regions of variable size. 

Fabian [6] accelerates the approximation procedure by doing a kind of line 
cighiue He takes additional observations of the objective function at X,+ LV, 
Antes a) ee . etc. and chooses 


] . : 3 
pn = > where j = max{ilg(Xn + —Y¥ns€n) < 9(Xns€n)} 


He shows convergence but was unable to prove that this procedure is better 
than the DSR. 


17.3 A comparison of different rules 
It is rather impossible to give general statements about the superiority of one 
rule over another because detailed analysis of the performances of many rules 
has not been done. Therefore we restrict ourselves to compare them only for a 
very simple but basic stochastic approximation problem. 

We assume that S = R*, i.e. an unconstrained problem and J (2) = $2’ Az, 
a quadratic form with positive definite matrix A. The stochastic gradients are 


Vo(z,€) =Ar+Z 
where the errors Z ~ N(0,C). Thus the procedure is 
Xn+1 =X, = pn AXn = PnZn (17.30) 


with {Z,} being a sequence of iid. N(0,C) random variables. Clearly the 
solution 2” equals zero in this example. 
If Xo = 2 is the starting value then X, may alternatively be represented 


Xn= Tle- piA)2o — > pi The — pjA)Z; 
= in| (I — p A)zo — Un (say) 


The first summand represents the influence of the starting value and U,, is the 
“error” term. Consider first the DSR-situation, i.e. pp = &. Then 


n-1 n-1 


p Poise. 
oe. T@- 54)4 


i=0 ° j>i 
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has a normal distribution with expectation zero and covariance matrix 

n-1 p n-1 p n—-1 p 

La= S (=) I (I- SA)C II (Z- =A’) 

i=0 jol J j>e J 

which converges in accordance with (17.13) to a solution of 
I I 

(pA- a= + L(pA’ — 5) = C. 

On the other hand if p, is kept constant p, = p (like in the DOSR, RPSR and 


OTSR situation) then the error term 


n—-l 


Un =>, of — pA) *1Z; 


=0 


converges in law to the autoregressive AR(1) process 


fee] 
Un =p >_(I- pA)! Zn-i 
i=0 
Thus the approximation process with constant stepsize may be represented as 
a sur of a component converging to zero and a stationary process. This is in 
accordance with (17.25). P 
The covariance matrix of Up satisfies 


Lee] 
Cov(Un) = 0? S_ (I — pA)'C(I — pA’) = pL + 0(0) 
r=0 
as p— 0 where © is given by (17.26). 
In a similar manner the gradient process Y, = AXn+Z, may be rewritten 


as 
n-l 


Yn = A(I— pA)" - pA) (I- pA)” Zi + Zp 
i=0 
= A(I — pA)" ~ Wy, (say) 


where W,, converges in law to an autoregressive moving average process W,, 
~ co . 
Wr= pa you = pA)’ Zan — 2a 
i=0 
Thus the expectation of Y/Y,_) is approximately 
E(¥,Yn-1) © to A(I — pA)" (I — pA')"A'z9 


+ tr (c (« yr — pA)! - rs)) 


i=0 
we tg A(I — pA)"(I — pA’) A’xg + ptr(A’AL) — ptr(AC) 
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(see (17.27)). Thus testing Y/¥,_1 with the oscillation test (OTSR) is equiv- 
alent with testing whether the term zp A(J — pA)"(I — pA’)"A'zo is already 
negligible. Since this criterion takes the influence of the starting value into ac- 
count the resulting procedure leads to uniform asymptotic confidence regions. 
This is the main advantage of this method. 
The DOSR compares progress in the objective function with progress in 

the argument. The objective function is estimated in our example by 

f= LY (XIAX, + ZIAZ 

In= 9, DI { it t i) 


Thus the expectation of this estimate is 
E(fa) & = (tr(2AE) + tr(AC)). 
On the other hand the expression || X; — X;-1||? has expectation 
E(|X: — X;-1)!?) = tr(eAB) + tx(C) 
so that for very small p 
E(\|X; — X;—1) = o(tr(C)'”). 
Hence for very small p 


ti _ _tr(AC) 
Soe Xl ~ ae? 


irrespectively of p. We see that the DOSR will take approximately the same 
number of steps between two consecutive stepsize reductions. 
What concerns the ASR, it was shown by Ruszynski and Syski that for the 
sequence given by (17.22) 
Npri/d as. 


Thus this rule leads back to the DSR case, at least in an asymptotic sense. 
Let us turn now to the case 6 = 9. If C =0, ie. the error term Z is zero, 
then the rule (17.22) reduces to 


log pnt+i = log pa + (XnA?Xn = pnXnA° Xn). 
Since X,A?7Xp—pnXnA2Xn > Oif py < ~~ where Amax denotes the maximal 


Amax 
eigenvalue of A one can see that in this case p, does not converge to zero. This 
results in an exponential speed of convergence of X,. The weighted means X , 
converge then with the rate 1/n. 


If the error terms are present, i.e. C # 0 it may be shown that 





Pn 
— converges a.s. 
VEC) 
Thus the assumption Lp? < oo is not fulfilled. Hence the procedure Xp is not 


convergent itself and the weighted means Xp, converge with a speed E(X,) = 
o( 26 n ). 


n 


Stepsize Rules, Stopping Times 369 


17.4 The implementation of the oscillation test 


The oscillation test routines were implemented by the author at IIASA. Some 
details of the implemented algorithms are given here. 

Recall that the method consists in keeping the stepsize constant as long as 
the test rejects the hypothesis that the behavior of the path is already oscilla- 
tion. If this hypothesis is strongly rejected then even an increase of the stepsize 
is advisable. 

The method needs estimates for A, the hessian of the objective function at 
z* and C, the covariance matrix of the errors. 

An estimate for C is easily found. At each step we take fwo independent 
observations of the stochastic gradient 


AY = Va(Xn, 8) 
¥?) = Va(Xn 2?) 
and calculate : @ 
Yn = 4(¥a +¥n°) 
Z, = 3 (¥A) —y,) 


The variance of Y,, is half of the variance of y(, This random variable is 
taken for determining X,4+1. The error variable Z, is used for the estimation 
of C. As we know from the general considerations the asymptotic distribution 
is concentrated on the largest linear subspace contained in the tangent cone of 
S at 2*. Let Ky, be the tangent cone of S at Xn. (If X,, is in the interior of S 
then K = R*). Let H,, be the largest subspace contained in K,, and let Z,, be 
the projection of Z, onto K,,. Then 


n 
CG, = ~ S027; (17.31) 
i=1 


is used as an estimate of C. 

Next define AX, = Xn—Xn—1) AYn = Y¥n—Yn-1- As E(AYn) = AAX n+ 
0(||Xn — 2° |]? + ||AX,]|?) we may construct an estimate of the relevant part 
of A as follows: Project AY, and AX, onto H, to give AY, and AX,. The 
matrix A, should satisfy 





AY, ~ AAX, (17.32) 


Thus we may adjust An recursively to satisfy (17.32) by setting 


BX 
|AX,,|? 


~ nw t nA —— 
Anyi = An — —(AnBXn — A¥n) 





(17.33) 


It remains to solve th equation 


“A 


A De + SA; = Cc; 
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for the determination of the covariance matrix of X, (see (17.26)). To use 
the indicated explicit formulas is however too time consuming. We use instead 
again a recursive way of solving (17.26), namely 


A AN “A 


LA A 
Lindi =S=Lin- qiAntin + LnAn = Cn]. (17.34) 


It may be shown that if AoA (positive definite) and On — C (positive defi- 
nite) then L,, converges to a solution of (17.26). The oscillation test compares 


1 


n-Vp 





> (Yi, Fi41) with pa(tr(A,AnEn) — tr(AnOn)). 


f=yynt+1 


If the difference is smaller than 7, the hypothesis of oscillation is accepted and 
the stepsize is decreased (by a factor y2 - usually 7, = 5). 

If however the inner products are much larger than their asymptotic ex- 
pectations, the stepsize is increased (by a factor yg > 1). 

The asymptotic confidence region can be found by looking at the distribu- 
tion of the quadratic form 


(Xn — 2*)’D(X, — 2") 
which is approximately 
r{ (DE,) — 2tr((DE,)?) 
2tr((DE,)2)" — tr(DE,) 


(see (17.20)). If the upper @ percentile of this distribution is smaller than ¢ 
then the whole procedure can be stopped and we know that 


P{\|X,-2*l|p <<: v~l-a 
if ¢ is small enough. 
Sometimes it is not required to know 1 — a confidence regions but the 
knowledge of the expectation of (X, ~ 2*)’D(X, — 2*) suffices. Since 
E((Xn- 2*)'D(Xn —2"))% ptr(DZ,) 
this value can easily be calculated. If this value is smaller than some prede- 


termined constant the whole procedure stops. Due to the careful testing and 
estimation the final value obtained is a very reliable one. 
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CHAPTER 18 
ADAPTIVE STOCHASTIC QUASIGRADIENT PROCEDURES* 


S. Urasiev 


18.1 Introduction 


In this chapter we deal with iterative algorithms for solving stochastic opti- 
mization problem 
min Ef (x, w) (18.1) 


subject to constraints 
zEXcR" 


where x are variables to be chosen which take values in Euclidean space and w 
are random parameters which belong to some probability space. Our main con- 
cern is the improvement of performance features of the stochastic quasigradient 
(SQG) method 


got! = rx(2* — ps€*) (18.2) 


where zx is the projection operator on the set X, 2°-current approximation 
to solution, p, is the stepsize and €° is step direction, which roughly speaking, 
in average points to the direction of gradient of the function Ef (z,w). Reader 
can find survey of such methods and further references in Chapter 6 (see also 
[1]). One of the main challenges which arise before implementor of SQG meth- 
ods is appropriate selection of the stepsize p,. Theory gives only very general 


guidelines: 
oO [s.e) 
Ps — 0, 5° pe = 00,5" 63 < OO. 
s=—0 s—0 


In papers written earlier on stochastic approximation [2], stepsize was cho- 
sen in advance to satisfy these conditions. For instance, p, = c/s. In what 
follows, such choices which depend only on iteration number will be called pro- 
grammed or off-lined rules. Unfortunately they lead to very slow convergence, 
although they assure in some sense optimal asymptotic rate. However, in prac- 
tical computations SQG methods can be used to reach reasonable neighborhood 
of solution, not exact value of solution. For such purposes, asymptotic results 
are not relevant as well as programmed rules of choosing stepsize. In this chap- 
ter, adaptive or on-line rules for computing p, are studied which exhibit much 


* This chapter is based on the report presented at the International 
Conference on Stochastic Optimization, Kiev, 1984. 
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more satisfactory behavior. Such methods utilize information gathered during 
optimization process to make decision about current value of stepsize p,. More 
specifically, ¢, may depend on observations of random function f (x* ; uF) or sto- 
chastic quasigradient €* in some or all preceding iterations & < ¢. Some on-line 
rules were proposed in [$]-[7]. This chapter is based on [5]-[7] and describes 
one particular adaptive SQG method in which stepsize increases or decreases 
depending on whether subsequent quasigradients point to the same or to the 
opposite directions. 

This chapter consists of 5 sections. In Section 18.2 the adaptive SQG 
method is described, its convergence is analyzed in Section 18.3. Implementa- 
tion details are discussed in Section 18.4, and the chapter ends in Section 18.5 
with a description of some particular problems solved by algorithm together 
with results of numerical experiments. 


18.2 Algorithm Description 
In what follows we shall consider algorithm of type (18.2) for problem (18.1). It 
will be assumed that the process takes place in probability space (0, A, P) where 
A is o-field and P- probability measure. Vector €° from (18.2) is stochastic 
quasigradient, i.e. 

E(€°/B°) = F, (2°) + 8° 
where F;,(zx*) is gradient of the function F(z) = E,, f (z,w), conditions on 6° will 
be imposed later and B® is o-field defined by the process history, i.e., random 
variables {2°,...,2°}. We shall keep in mind that 2° depends on random 
parameters from (2, but will not specify this dependence explicitly. 

We shall explain at first the idea of adaptive stepsize control informally. 
Here, for simplicity we shall assume that function F(z) is smooth and X = R”. 
It is quite naturally to choose step p, to minimize F(x) along direction €°, i.e., 
such that function ,(p) has minimum over p, where 


Poo) = E[F (2° — €*)/2°). 


This is analogue of stepsize rules used extensively in deterministic optimization. 
It is easy to see that 


FP) le = ELS F(e = PE Vp=p6 /2*] 
= -E|(VF (2° — ps€°),€°)/2°] 
= -BL(VF (2), €°)/2"] 
=-El(ee > jot], o= 0,1... 


Thus, —(€°+', €*) is stochastic quasigradient of function y,(p) in point p, on 
iteration ¢ +1. To modify step ps, let us use the following gradient procedure: 


Po+1 = pet re (E°*' €*)A, > 0,8 = 0,1... (18.3) 
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The value of (€°*1, €°) gives some information whether current value p® exceeds 
minimum of function ¢,(p) over p on the iteration s. If (€°*',€*) > 0 then 
it is probable that minimum of ¢,(¢) is greater than p, and it is necessary to 
increase stepsize and decrease it when sign is negative. This information is used 
to modify step p,. 

Naturally, decision based on this arguments will be subject to error due to 
stochastic phenomena. However, this errors will be smoothed out in the course 
of iterations. It is convenient to rewrite relation (18.3) in the following from 


Pott = Patt, (E°*',€*),a,>0, & =0,1,..., (18.4) 


(18.3) is the special case of (18.4) since for each A,, such that p,41 > 0,a, 
can be selected respectively so that p, +1 computed by formulas (18.3), (18.4) 
coincide. In order to guarantee fulfillment of the convergence condition for SQG 
algorithms )>>~.y ps = co (see Chapter 6), the value a, is calculated by formula 


a,=af*,a>1, s=0,1,..., 


Convergence of the algorithm (18.2) with the stepsize rule (18.4) can be estab- 
lished [7] in deterministic case, when €° = F,,(z°) and F(z) is a strongly convex 
function. For stochastic case, let us modify formula (18.4) as follows 


bei = pao Ert s€°)— bp. 


18.5) 
+18 petly_ ( 
= pa ee 96 §>0, #=0,1,... 


Introduction of the term ép, guarantees fulfillment of one more convergence 
condition 
Pe —> 0 as. 8 —> ow, 


18.8 Convergence Analysis 


Besides convergence of sequence 2° to the solution of problem (18.1), we are 
also interested in convergence of some convex combinations of this sequence. 
With sequence x° generated by algorithm (18.2), (18.5), it will be associated 


the sequence 
& 8 
x = Spex’ / 5 > pe (18.6) 
é=0 e=0 


and the convergence of 2° to the solution will be studied. If such convergence 
does occur the initial sequence 2° will be called Cesaro convergent. The advan- 
tages of dealing with such convergence are the following: 


— the sequence %* displays much more regular behavior than original sequence 
x? 

— #° can be computed with almost no additional effort in iterative way using 
the sequence 2”. 
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— some convergence conditions can be relaxed for Cesaro convergence and in 
some cases 2° does not converge to the solution in ordinary sense, but is 
Cesaro convergent. 


This type of convergence was used in [8],[9]. The following theorem from 
[7] gives conditions for Cesaro convergence of the method (18.2). We shall use 
the abbreviation a.s. for the words “almost sure”. 


Theorem 1. Let F(z) be a convex function defined on convex closed bounded 
set X CR", 


amax |e - yl] =r, (18.7) 
Eljé* — F, (2°) — 6°||? < c3 (18.8) 
>. Ihe T 
Tall <3, (18.9) 
fp >Oas., ¢ =0,1,..., (18.10) 
Epi<oo, #=0,1,..., (18.11) 
Pe —> 0 a.s. with 8 —+ oo, (18.12) 
oO 
a. pPo=O as, (18.13) 
0 


and at least one of the two following conditions is satisfied: 


{1) Step p, depends only on (2°,...,2°,€°,...,€¢') fit is measurable with 
respect to o-algebra B, induced by (2°,...,%°,€°,...,€°"')}; 

(2) pep, —? 1 a.5., p» depends only on (2°,..., 2°, €°,...,€°) (it is measur- 
able with respect to o-algebra induced by (2°,...,2°, €°,...,€°)). 


Then 7 
lim F(%°) — F(z") < 6C, a.s. 
where 
ge EX*={2":F(e ) = min F(z)}. 


and ®° is defined by (18.6). 


Corollary. Ifb* — 0 a.s. then all accumulating points of the sequence =° are 
the solutions of problem (18.1). 


The main difference between conditions for Cesaro convergence and conver- 
gence in usual sense for SQG methods it that condition )>>-.9 p?2 < 00, which 
is needed for normal convergence a.s., is not needed for Cesaro convergence. 
This makes verifying convergence conditions for adaptive SQG methods much 
easier. 

Now we are prepared to give convergence results for adaptive SQG method 
(18.2), (18.5). 


Adaptive Stochastic Quasigradient Procedures 377 


Theorem 2. Let f(z) be a convex (possibly nonsmooth) function defined on 
some vicinity of convex compact subset X C R". If the following conditions 
are satisfied 


max ||z — y|| = C (18.14) 
sup ||€°|| < C3 a.s. (18.15) 
Tim, -..0||5°|| <5, (18.16) 
6 > Cylimy 00 redtt |e? — Al a.s. (18.17) 


where OF (cz) is the set of subgradients of F(z) at point z. Then 
Tims — 0 (F (2) — mip F(z) < bC; as. 
ie. if lim,_... 6° = 0 then 
F(z*) —- ming € XF(z) — 0 as. 
fe ra accumulating points of the sequence & are solutions of the problem 
18.1} a.s. 


Proof. Condition (18.10) of Theorem 1! follows directly from (18.5) since pg > 0 
and a > 0. Here we shall give only an outline of the proof, which consists of 
checking conditions of theorem 1. We shall check here conditions (18.12)-(18.13) 
of the theorem 1 and assume 6° = 0 (for more details see [5]--[7]). 


1. Let us show that condition (18.13) of Theorem 1 is satisfied, ie. \yp2y ps = 
oo a.s.. Assume the opposite, i.e. exists such constant K that probability 
of the event 


Ati ew 


6=0 


is positive. From projection properties and (18.15), we get the estimate 
xo? — 2° < |]pe€°|] < peCo as. (18.18) 
Stepsize rule (18.5) together with (18.18) yields: 


a+i to yee tl 6 atl) _ 
po = paar 80-298) —Fp0 > p,q llEeT Mile?— 2° II- bp 


> pea (Cat Pe — p,q Care 
where C3 = (C3 +6). Therefore for w € A the following relation holds 


-03h8 CK 


Pari 2 pea 93°* > poa > poa” 
which implies Dre Pe = co for w € A contradicting initial assumption. 


Therefore condition (18.13) is satisfied. 
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2. Consider now condition (18.12) and let us prove that », —+0 as s —+0 
a.s. Denoting 
Cs = |é° — A] 


eae a) 


we obtain the following estimate: 


(6¢t1, 2 — 2°) — 69, < (B,(2°t}), 2° — 2°) 
ae (eet! — F,(2°t), 2° = aot) — 5p. 
< F(2*) — F(2°t’) 
+ eet? — Fe? )llle? — 2° "If -— 806 
< F(2°) — F(2°t!) 
+ Copolleet! — F, (2°*!)|] — dp, a.s. 


Since F,(2°+!) in the last relation is an arbitrary vector belonging to set 
OF (x**!), we obtain 


(6°71, 2% — a8!) — 5 pg S F(2*) — F(z") + Cros, ink |e" — hil — bs 
= F (2°) — F(2°t') + (CoC, — 5)p, as. 
By substituting this estimate into (18.5), we obtain 


Poti S peak (#9) F(x?* 1)+(CoCa-5)pe 


en (Pa) F(e8t!)) + pg (CaGo—F)0¢ 


F(2°)-F(2°t1) 457) _ 9 (CoCo—8)p¢ 


S poa 
= pod 


Taking into consideration )> >) Ps = a.s. and relations (18.14), (18.17), we see 
that the expression in the exponent in the last relation tends to —oo: 


é 
Jim [f (2°) — f(2*))+ S°(C2C. — 5) pe] —+ -co as. 
&0 
Since a > 1, this implies p, > 0 a.s. 
Now, we have to show that condition (18.2) of Theorem 1 is satisfied. The 
following relation is satisfied: 
Pott _ teeth 28 28tl)_ 5p, 

Ps 

Since p, — 0 a.s., then 


(€o+", a? — 2°t!) — §9, —+0 as. and 
Pott 
Pe 
after all conditions of Theorem 1 are tested, the statement of this theorem 
follows from it. 


— 0 as. 
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18.4 Implementation Strategies 

In this paragraph problems which arise during the implementation of stochastic 
quasigradient algorithm (18.2), (18.5) are discussed. Its implementation in- 
cludes some heuristical elements. The implemented method can be used for 
the fast finding of good initial approximation in the vicinity of solution. The 
implemented algorithm described below performed essentially better than the 
method with programmed rule for step size selection. First, we shall present 
the algorithm and then discuss some of its features. 

Algorithm. Set s=0 at the beginning of the computation. 

Step 1. Computation of stochastic quasigradient €°. 


Step 2. Averaging of the stochastic quasigradient norm ||€°|| 
Gy = Go-1 + (l]&°l] — Go—1) - D. 


At the beginning of the computation G, = 0. 
Step 3. The computation of the average current point drift 


Q, =Gaps 


Step 4. Check the stopping criterion: if Q, < Qs or ¢ > 6, finish the 
computation, otherwise go to the next step. 


Step 5. The computation of scalar production T7,: 
T, = (€°,2°"' — 2°). 
Step 6. Averaging of the T, absolute value: 
Ze = Ze-1 + (|To|— Ze-1) - D. 


At the beginning of the computation Z_, = 0. 
Step 7. Rule for the step size p, selection: 


= op 1 ift,>0 
Po = PotR&e xX) 1 rg, 


Step 8. Reducing the step size change. 


3pe-1 if Pebay > 3, 
bo= 4 Pet if pp,3, < 47}, 
Pa otherwise. 





Step 9. Finding the next approximation 


s+ 


z = 2° — pg&°. 
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Step 10. Projection on the feasible region X 


+1 


z = 1x(z°t') 


Step 11. Take s=6+1 and go to Step 1. 


Two stopping criteria are implemented in the method. The first one is 
by the number of iterations. The second stopping criterion is by the value of 
the mean point trend which is equal to the product of the mean norm of the 
quasigradient €° by the step size »,. When the value of the shift becomes less 
than the threshold value Q,, the method stops (steps 3,4). The step size control 
differs from theoretical one (18.5) in several aspects (step 7). For one thing, 
value T, is divided by the averaged absolute value of T,. Additional reduction 
of the step size by means of factor U, 0 < U <1 is introduced. The additional 
reduction takes place only if 


T= (€°,2°7! —2°\ <0. 


Since T,/Z, is some random value, step size p, can increase or decrease, some- 
times by too large a factor (step 7). In order that the next step does not 
differ too strongly from the preceding one p,-;, some bounding coefficients are 
provided for increase or decrease of the step size (step 8). 

Recommendations on the chosce of the algorithm parameters. The following 
recommendations are obtained as a result of numerical experiments. 


- The value of the mean change of step size R(1 < R < 3) is usually set to 
R=2. 

— The value of the initial step size has no essential effect on the method 
convergence rate. However, if additional information is available, the initial 
value of the step size factor pp can be set approximately 


je — 2° (E (HEN) 


where z°- initial approximation, z*- estimated location of extremum point; 

- Parameter k defines averaging factor D = i in the averaging formulas 
(Steps 2 and 6). Usually k is selected within the range 4 << k <6 

~ Parameter U (additional coefficient of step size reduction) is selected within 
the range 0.8 < U < 1. With k > 1 coefficient U can be equal to 1 since 
step size decreases fast without additional decrease. 

— The value of mean shift Q, in stopping criterion is to be set approximately 
to the required solution accuracy for components of z. 


Adaptive Stochastic Quasigradient Procedures 381 


18.5 Results of Numerical Experiments 

Let us note firstly that it is advisable to average the values of variables and of 

the objective function during fixed number of the last iterations and take these 

quantities as the final approximation to the solution. The averaged value of 

coordinates z° will now be designated as z and the averaged value f(x*,w°) as 
Problem 1. The following problem is an example of multi-commodity 

facility location problem [7]. It is necessary to minimize 


F(z) = ES ~ max{a;(2; — 9;)3b; (8; — @;)}, 


f=1 


under constraints 


a + 2 + 222 + 384, + se = 200 
zy < 50 
2X < 7 
v3 < 7 
24 < 80 
te < 26 

2;2 0,2 = 1,5. 


Here 6; are random values uniformly distributed on intervals [A,,B,], 7 
1,...,5. Vectors a= (a1,.--545), b = (by,..- 568), A = (Aj,..-,As), B 
(B,,...,Bs5) are defined as follows: 


A 
a 


(0,0, 0,0, 0); B 
(1,0, 3, 1,2); b 


(60,15, 17,90, 40); 
(3,4, 1, 2,3). 


This problem allows analytical solution, which makes it possible to compare 
solution obtained by algorithm with exact one. The analytical form of the 
objective function is the following: 


f(z) = gat + feed + i703 + ao74 + fe 
— 321 — 4% — 23 — 2a4 — 3a, + 278.5. 


Stochastic quasigradient is computed by formula 


€? = (€7,...,€8), 
»_ fo, if af > 67, 
=) a, ife?<o?, i=1,...,5 


The following exact solution was obtained using quadratic programming meth- 


ods: 
= (41.88057; 7.00000; 2.48092; 41.27456; 22.33456), 


f(z*) = 98.100089. 
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Algorithm parameters are 
R=1.5;k =4;U = 0.9; pp = 1.0. 


Initial point is 
= (0,0, 0,0,0);f(z°) = 278.5. 


Step size on the 91st iteration g9, = 0.1532. 
The results for averaged values of the coordinates and of the function for 
91st to 100th iteration are as follows 


100 
1 
Ej = Fp de Bt = 155% = 40.5485; 29 = 6.9981; 
Te 91 
Bj = 2.4381;F4 = 42.2561;75 = 20.3561; 
A 100 
T(z) = TEA J {x°, 6°) = 97.4185. 


O51 


For comparison, below are given results of the solution of the same problem 
using the method with programmed control of step size. Initial approximation 
was the same. In asymptotically optimal [11] off-line step size rule p, = 1/€(s+ 
a), parameter £ must be equal to the least eigenvalue of the objective function 
Hessian, i.e., £ = 1/30. In this case we selected a = 10 and got approximately 
the same performance. However, our choice was based on exact information on 
objective function. If such information is not available, the off-line decision rule 
works in a much worse way. 

Problem 3. A random locational equilibrium problem (Weber problem) 
[12]. The classical statement of Weber problem is as follows: given 7 points 
wy t= = I,n in two-dimensional Euclidean space R?, find a point z € R? which 
minimizes the sum of distances ||w;—z||. In generalized statement of the problem 
[12] each point u;,7 = T,n is considered to be a random variable represented 
by some probability measure 9;(w) over R?. The problem now is to find the 
location of a point z € R? which minimizes the weighted sum of expectations 
of distances between point z and points w,;,i = 1,n, i.e. 


F(z) raf fe aan (de ia 


where f; > 0,2 =1,n. The stochastic quasigradient at point 2° can be chosen 


as follows: 7 
= y Bi 


m1 


where 


row. 
"= Taao8T 
0 otherwise 
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and w? is distributed according to 9;(w). 

In this particular example, the number of destination points was chosen 
as n = 30, and 0;,2 = I,n were taken as bivariate normal density functions 
whose means and standard deviations were generated randomly in the range 
0-20. The weights £; were also generated randomly in the range 0-10. 

Exact value of the extremum is z* = (8,36;9.36) initial approximation 
2° = (41,87). The results for averaged values of the variables z° for 50-th to 
60-th iteration are as follows 


1 & 
are o a 
ar wr. (9.1, 10.2), 


s=51 
and for 190-th to 200-th are 
gy 208 P 
z= 0 20? = (8.9, 9.0) 


If the initial approximation is z° = (54,30). The results for averaged values of 
the variables z* for 20-th to 30-th iteration are as follows 


1 30 
z=— >> 2? = (8.0, 10.1), 


6=20 
and for 190-th to 200-th are 
aS 3 a* = (7.9,9.7) 
10 00 a 


The following table contains detailed description of the problem. 


xy 3.02 6.07 9.77 16.26 6.12 14.80 7.24 7.52 15.91 13.57 
means 2.08 12.70 0.16 15.78 3.95 11.89 4.68 6.11 9.19 11.56 
12.43 19.98 15.33 18.20 7.84 1.16 4.54 17.48 10.78 1.45 

Xo 7.63 6.62 15.40 10.83 4.85 17.14 2.20 9.30 17.30 14.60 
means 5.68 4.77 19.10 17.17 0.80 10.82 11.48 18.99 0.36 2.52 
10.00 1.93 11.39 16.41 16.21 2.09 16.69 8.70 12.04 2.93 

zy 18.65 18.95 0.45 13.50 17.55 1.12 18.42 1.59 15.65 9.49 
devs. 19.13 18.19 19.56 19.14 11.938 7.26 1.72 11.37 7.09 16.05 
15.62 4.31 15.44 1.40 5.82 8.56 16.72 5.29 10.36 12.49 

x9 3.77 15.79 8.68 6.29 7.97 9.23 5.81 3.17 17.91 7.02 
devs. 16.27 15.08 5.12 6.11 1.55 19.25 8.24 17.78 13.48 9.80 
5.49 15.13 7.07 16.83 15.86 9.90 19.44 16.35 0.37 15.31 

weights 8.50 9.48 6.03 816 9.05 1.80 817 7.57 3.43 9.62 
2.87 3.77 4.34 488 0.11 2.13 7.75 1.64 5.75 6.12 

4.57 4.45 2.95 0.17 7.53 9.39 7.38 1.15 2.09 7.20 
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CHAPTER 19 


A NOTE ABOUT PROJECTIONS IN THE 
IMPLEMENTATION OF STOCHASTIC 
QUASIGRADIENT METHODS 


R.T. Rockafellar and R.J-B Wets* 


Given a stochastic optimization problem find z € X C R” that minimizes 
F(z) = E{f(z,€)} where f : R"cE — R is a real-valued function, the quasi- 
gradient algorithm generates a sequence {z!,z7,...} of points of X (converging 
to the optimal solution with probability 1) through the recursion: 


gvth = pyi(2” ks! pz”) 


where prj x denotes the projection on X, {p,,v =1,...} is a sequence of positive 
scalars that tend to 0, and z” is a stochastic quasi-gradient of F at 2”; see 
Chapter 5. 

Unless X is a simple convex set, e.g. a rectangle or a ball, the projection 
operation may be too onerous to allow for a straightforward implementation of 
the iterative step; one would have to find at each step 


gti — argmin|dist? (c” ~ ppz",2)|z € X], 


which means solving a mathematical program with quadratic objective func- 
tion. Therefore the implementations of the stochastic quasi-gradient method 
rely usually on various schemes to bypass this projection operation, through 
penalization or primal-dual methods, for example. There are however a few 
cases when it is possible to design a very effective subroutine to perform the 
projection operation. 

We describe a simple method for projecting a point 7 € R? ona convex set 
X, assumed to be nonempty, that is the intersection of a rectangle C C R" and 
a set, determined by a single linear or more generally by a separable nonlinear 


constraint of the type: 
n 


S54; (25) <4, (19.1) 


j=l 


where the a; are convex differentiable functions such that for every j = 1,...,n, 
the derivative ai of a a; (:) is positive and bounded away from zero on C' where 


C={reR"t;<2;<4;, f=1,...,n} (19.2) 


* Supported in part by grants of the National Science Foundation. 
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with £; = —on and u; = +00 if z; is not bounded below or above. We had to 
deal with suci a case in connection with the model described in Chapter 22. 
(For related work, cf. [2]-[6].) Since the derivative of a convex function is a 
monotone nondecreasing function, the preceding condition on the derivative is 
satisfied if (and only if) 


a,(l;) > 0 if é; is finite (19.3) 


or if £; = —oo 
_Jim | a’;(r) =a; (2) > 0. 
Set a’,(u,) = jim a’(r) if u; = +00. In the special case when a;(-) is linear, in 


which case we write 
a;(zx;) = 4;2;, (19.4) 


this condition boils down to having a; > 0. 
The projection prj, ¥ of 7 on X is the optimal solution of the (convex) 
nonlinear program 


find 2zECcR” 
such that Y> a;(2,;) <5 (19.5) 
j=l ; 


1 aes 
and z= gist? (9, 2) is minimized. 


Here “dist” is the Euclidean distance, i.e. the objective is the quadratic form 
n n n 
dist? (9,2) => 27-2) 92; + 9}. (19.6) 
j=1 j=l j=l 
Since the feasible region 
n 
X =CN{2|)_ a;(z;) < 5} (19.7) 
j=l 


is a closed convex set, and the objective is an inf-compact (closed and bounded 
level sets) strictly convex function, the projection problem (19.5) is always solv- 
able and it has a untgue solution which is prjx 9. 

Of course, it would be very easy to find the optimal solution of such a prob- 
lem if there were no additional constraints besides z € C. Our purpose is to 
show that with a single additional constraint it is possible to devise an algorith- 
mic procedure for solving (19.5) that requires only marginally more work. This 
is achieved by constructing a (partial) dual to (19.5) whose solution gives us the 
(optimal) Lagrange multiplier \* to associate to the constraint L,a;(z) < 6. 
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When this multiplier \* is known, then the theory of convex optimization allows 
us to replace (19.5) by the following separable convex optimization problem: 


fnd zECc Rk" 


such that LG ~ §;)? + \’a;(2,)] is minimized. (19.8) 
j=l 


The solution to such a problem yields z* = prj Y, with 


| é; if (€; — 9) + A\*a;,(€;) 2 0, 
* 


y= 4 ay if (u; - 9;) + Aai.(u;) <0, (19.9) 


2; where 2; + A*a’(z;) = G;, otherwise. 


In particular if a;(-) is linear (19.4), then (19.9) becomes 


e; if (€; - Gj) + A*a; 20, 
y;—A*a; otherwise. 


Thus all] that is needed is an efficient procedure for finding A*. To do so let us 
consider the following convex optimization problem: 


find \& Ry 
; est (19.11) 
such that g(A) is maximized, 
where 
el me 
g(A) = min De 5 (25 - Gj)? + Aa; (2;)| — Ad. (19.12) 


In fact this problem is dual to our original problem (19.8). This claim can be 
substantiated by appealing to the general duality theory for convex optimization 
problems, cf. [7]; the Lagrangian generating (19.5) and (19.11) as a dual pair 
of problems is the function: 


wyail9 (2s — 95)? + Aa;(25)] — 8 ifzEeC,y 20, 
L(e,) = 4 425 if 2¢C,d>0, 
—o0 if <0. 


We can also argue directly as follows: define 


o(n) = sup(nd +9(d)|A € Ry]. 


Note that y(0) is then the optimal value of (19.11). From (19.12) it follows 
that 


— * : Aveo . 1 - 12 A 
p(n) ops [Sate | pdist” (2, 9) 


j=l 
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and in particular for 7 = 0, since X = C/N {z| y= a;(z;) < 5} is nonempty, 
we obtain a 
oo WN pasieg eo. angie 
(0) = min 5|dist? (2, 9)] if 2 24(2) <b 
j= 
which is the optimal value of the projection problem (19.5). The equality of 


the optimal values implies in turn that if z° solves (19.5) and \° solves (19.11) 
then from definition (19.12), we have 


\° (5 a;(25) — ) =0 (19.13) 


Thus the multiplier \* that we seek, to substitute in (19.9), is the optimal 
solution of (19.11), the 1-dimensional optimization problem (on R). For any 
\ € R,, we can find an explicit expression, that yields the argmin of (19.11), 
similar to (19.9), namely 


és fA 207 = (Gj — &)/a5(6), 


z;(X) = uj ifA< ny = (9; = u;)/a',(u5), (19.14) 
: a; ifn; <A nt 


where 2; + Aa),(2;) = Gj. 


Note that we have used the facts that ay is nonnegative and nondecreasing, so 


that a’.(€;) < a’(u;) and hence 7; <j for all 7. With 
r0)=GP <j), 


J+(\) = G2 0f}, vt) 


and 
J(A) = {a|nj <A <n}, 
we have that 
= Yo [ly - 9)? +A4;(u,)] 
sJEI~ (A) 
1 de 
+ DD left[s(& - 95)? + Aa;(6)] (19.16) 


jest (a) 


+ YS [5 le) H)? + ray (es(a)]— 28. 
FETA) 


The function g is concave: expression (19.12) gives us g as the sum of a linear 
function (—6)4 and a min-function (of a collection of linear functions in A). Thus 
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the derivative, if it exists, is a monotone nonincreasing function of \. Finding 


the maximnm of g on R, corresponds to finding * such that g’(A*) = 0, unless 
g(0) <0 in which case \* = 0. Here, unless a’, is pathological, we have that 


FAX)= So a(w)t+ So a;()-6 


jes-(a) gjEsIt(r) 
+ D> [lzs(d) 9) 24(d) +.45(25(A)) + Aa} (2s(d))24()] 
FEI) 


and using the definition of z;(A) when 7 € J(A) this simplifies to 


PXY= SY ala)+ YS a(C)t+ DO aj(zj())-4 — (19.17) 


JEI~ (A) get*+(d) JEI(A) 


In the linear case, this becomes 


g (A) = os. ajuzt > a;l;+ y= [a;9; — a} | - 6. (19.18) 


jeJ—(d) geIt(a) JEI(A) 


To find \* € argmax[g(A)|\ € R,], we propose the following procedure: 


Step 0. Order CPRLERY) =1,...,9}, say as (01,...,2n), recording for each 
6; the corresponding label (j,—) or (j,+). (Ties correspond to an entry in the 
6-vector repeated the appropriate number of times.) 

Set 0- =0,0+ =8@, with p= min(j|0; > 0.) 

Construct J~(@~ = 0), J+ (0), J (0). 

Compute 


(0) = So ajlu)+ Do aj(&)+ DO a5(0;)-6 


jes— (0) geJlt+(0) se (0) 


If 7 (0) <0, stop. Set \* = 0 and exit. 
If 7 (0) > 0, continue. 


Step 1. Compute g'(#*) using (19.17) or (19.18). 
If 7 (6+) <0, then find \* € [9— 07] such that g’(\*) = 0, exit. 
If g/(8*) > 0, continue. 


Step 2. Setp:=p+1,d- :=0+,07 := 6, 
Adjust J~ (0), J*(8-), J(@7) 
Return to Step 1. 


The algorithm clearly converges since it is a systematic search of a mono- 
tone nonincreasing function that eventually must reach the interval [a,, ap+1] 
in which g’ takes on the value 0; the problem is known to have a solution, see 
the preceding comments about duality. 
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In the linear case, all operations prescribed by the algorithm are simple 
and straightforward. The derivative g’(\) is given by (19.18). In Step 1, when 
g'(at) <0,A* is given by the expression 


M = B/y 
where 
f= > aju; + eS ayes +>. 4,9; — 6, 
jEed~ jest ged 
and 
=): 
jel 


When the a,(-) are nonlinear, the evaluation of g/(\) requires first the 
evaluation of z;(A) for all 7 € J(A). Also in Step 1 there may be difficulties in 
finding \* when g'(8*+) < 0. To begin with, let us consider the equations 


ajt+ Aa’, (2;) =; (19.19) 


Usually, there are many situations when it is easy to find a closed form expres- 
sion for z; as a function of 4. For example, if a;(z) = az? +z2+ 4 with a >0 
(recall that a,(-) is convex), then 


25(A) = (9; — AB)/(1 + 20). 


In general, however, even when an explicit expression for the derivative is avail- 
able, we may have to resort to a numerical procedure for finding z;(A). But 
here we are greatly aided by the following observation. For \ € [p75] the 
function 

zt (z+ Aa5(z) — 9) 


is monotone nondecreasing between £; and u; with 
(5 — 93) + Aai(t;) < 0 


and 
(u; — 9;) + Aai(u;) > 0, 


as follows from the definition of n; and ;,, see (19.14). Thus a secant method 
[1], that we used in our implementation, is a very efficient procedure to find 


z;(A). 
We now turn to finding \* with g’(\*) = 0, knowing that 
g' (6-) > 0 and g(6+) <0, 
where g’ is given by (19.17). The sets J~ (A), J+ (A) and J(\) remain fixed on 
this interval. Let 
B=b— Y7 a;(uj)— Do 4,6), 


jed~ jert 
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and 


4() = S> a;(z;(d)). 


JET(A) 


Note that from the definition of @~ and @* it follows that 


n; SOS gt S03, for all 7 € J. 


Moreover, \ ++ 7(A) is a decreasing function with 
4{8-) > B and 7(6T) < p. 


We need to find \* such that 7(\*) = #. Unless we have some expression for 
a;(z,;{X)) that can be handled easily, we again need to rely on a numerical 
procedure, and in this case too the secant method suggests itself [1]. That is 
what we have used in our own implementation of the procedure. 

This projection method is extremely efficient in the linear case but also 
produces very good results in the nonlinear case, in which case its efficiency is 
that of the secant method used in finding \* and z; (A). 

If there is more than one constraint, in addition to the upper and lower 
bounds, it may still be possible to use the procedure outlined here. For example 
it is possible to keep track of the active constraints, and when only one (or no) 
extra constraint is violated then we could use this procedure to obtain the 
projection, provided the projected point does not violate some other constraint. 
We should thus be able to cope with two or three extra constraints, resorting 
only once in a while to a general optimization procedure for solving (19.5). 


Acknowledgment: We very much appreciate the comments of Dr. Richard 
Cottle (Stanford University) as well as his help on bibliographical questions. 
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CHAPTER 20 
DESCENT STOCHASTIC QUASIGRADIENT METHODS 
K. Marti 


20.1 Introduction 


The FORTRAN-code “SEMI STOCHASTIC APPROXIMATION?” can be ap- 
plied in solving stochastic optimization problems of the following type 


minimize f(z) 


; (20.1) 
subject to zeED, 


where D is a closed convex subset of IR” and F = F(z) is the convex mean 
value function defined by 


F(a) = Eu(A(w)2 —6(w)), 2eEIR". (20.1.1) 


Here (A(w),(w)) is an mx(n+1) random matrix and w is a convex loss function 
on R™ such that the mean value F(z) in (20.1.1) is real for every z € R". We 
suppose that the set D* of optimal solutions 2* of (20.1) is nonempty. 


Problems of the form (20.1) arise in many different connections, as e.g. 


— Stochastic linear programming with recourse [7], [23] 
— Portfolio optimization [9], [23] 

- Error minimization and optimal design [3], [20] 

— Statistical prediction [1] 

— Optimal decision functions [5], [10]. 


Since the gradient (or subgradient) OF of F exists under weak assumptions 
and is given then by the formula 


OF (2) = EA(w)'du(A(w)2 — 6(w)), (20.2) 


where A’ is the transpose of a matrix A and Ou denotes the subgradient of 
u, our basic problem (20.1) could be attacked in principle by a gradient (or 
quasigradient) procedure of the type 


Te+1 = Po(te — progr), & =1,2,..., (20.3) 


where px > 0 is a step size, ge € OF (xy) and Pp denotes the projection of IR” 
onto D. 
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However, in practice the computation of the gradient (subgradient ) OF (zx) 
causes in general the following difficulties: 
e Either formula (20.2) can not be evaluated at all because only a stochastic 
estimate Y; of an element gx € OF (zx) is available [3], [21]. Hence, in this 
case we only have 


Yr = gk + noise with some gx € OF (zx); (20.4.1) 


e Or, though the integrand A’du(Az — 6) and the probability distribution 
P.a(y,b(.)) of A(w), b(w) in (20.2) is known, the numerical evaluation of this 
formula (20.2)—involving a multiple integral—is impossible in practice, 
since it takes too much computing time. In this case OF (z,) may be 
approximated by 


Y¥, € A(w,)/du(A(w,) 2% -_ b(w%)), (20.4.2) 


where (A(w),6(w)) is a realization of the random matrix (A(w),6(w)) 
generated independently of z; by means of a pseudo random generator 
[11]. 

Consequently, in both cases (20.4.1) and (20.4.2) the gradient procedure can 


not be applied in practice. 
It is therefore often replaced by the stochastic quasigradient method [8], 
[6] 


Xk = Pp(Xx — pr¥e),k =1,2,..., (20.5) 


where the random direction is defined now as described by (20.4.1) or (20.4.2). 
Selecting a priori a sequence of step sizes p1, p9,... such that 


fo] oo 
pk > 0,>> pe= +00, > ph < +oo, 
k=1 k=1 


€.g- Pk = zr for some constants c > 0 and g € INU {0}, it is well known 
[19], [21] that the sequence of random iterates X;,X,... generated by (20.5) 
converges with probability one to the set D* of optimal solutions z* of (20.1), 
provided that the approximates Y; of OF (2,) fulfill a certain uniform second 
order integrability condition and D* is a bounded set. 

Unfortunately, due to their probabilistic nature, stochastic approximation 
procedures only have a very slow asymptotic rate of convergence of the type 


E||Xx — 2*||? = 0(4-*), 


where \ is a constant such that 0 << \ <1. 

Moreover, the main disadvantage of stochastic quasigradient procedures 
(20.5) is their nonmonotonicity which sometimes may be displayed in a highly 
oscillatory behavior [4]. Hence, in many cases one does not know whether the 
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algorithm has reached already a certain neighborhood of an optimal solution 
x* or not. 

For improving the convergence behavior of (20.5), several methods were 
suggested, e.g. based on the adaptive selection of the step sizes px, see [8], or 
based on the use of second order information about F’, see [18]. 

A further method—having a partial monotonicity property— is presented 
in the following. 


20.2 Semi-Stochastic approximation 

As was shown in several papers [10], [12], [14], [15]. [17], for several classes 
U of convex loss functions u and several classes II of distributions P(4(.),5(.)) 
of the random matrix (A(w),6(w)), our minimization problem (20.1) has the 
following important 


Property: (20.6) 

At certain “nonefficient” or “nonstationary” points z € D there exists a de- 
terministic (feasible) descent direction h = h(x) of F which can be computed 
with less computing expenses than an element 9; of OF (2;,). Moreover, h(z) is 
stable with respect to variations of the loss of function u € U. 

Consequently, if at a certain iteration pomt X; this property (20.6) holds, 
then clearly one will replace the stochastic direction —Y,, which is a descent 
direction only in the mean, by the descent direction hy = h(X,) of F available 
then at X;, with low computing expenses. 

Hence, we obtain—as already described in [11], {18]—the following 


Descent Stochastic Quasigradient Method 
Re Pp(Xit pehe), if (20.6) holds at 2% 
a Pp(Xz— prYe), else. 


In many important applications this hybrid procedure has the important feature 
that property (20.6) is fulfilled, e.g. at every second iteration point X,. Hence, 
in this case (20.9.1) has the more convenient form 


(20.7.1) 


Pp(X; + pele), ifke N, 


Xe = foe — Ye), REN, (20.7.2) 


where IN,,INz is a known decomposition of the set of integers IN, e.g. IN; = 
{1,3,5...}, No = {2,4,6,...}. As was shown in [18], if the step sizes p:,2,... 
are chosen such that 


oo 
pe >0, 3) pk <+00,) > ph < +00, 
kelNg k=1 


then the semi-stochastic approximation procedure (20.7) converges with prob- 
ability one to the set D* of optimal solutions 2* of (20.1). 
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As expected, several numerical examples [11] show that the descent sto- 
chastic quasigradient method (20.7) has a much better convergence behavior 
than the pure stochastic quasigradient method. Especially, the highly oscil- 
latory behavior of the random iterates X,,k = 1,2,..., observed in (20.5) is 
damped very much by using also deterministic descent directions hy in (20.7); 
moreover, the set D* of optimal solutions is reached more exactly. In a recent 
paper [16] the rate of convergence of (20.7) could be estimated as follows. 


Theorem 2.1. Denote by b, = E||X;—2*||? and bf = E||X{—<*||? the mean 
square error of the descent stochastic quasigradient, pure stochastic quasigra- 
dient method, respectively. 


(a) Ifa fixed rate of stochastic and deterministic steps are taken in (20.7), then 
there are constants Q1,Q_ with 0 < Q; <1,Q, < Qg such that 


Q1:+ bf < be < Qa: bf as kK— > ow. (20.8) 


Furthermore, Q;,Q¢ are given by known formulas and Q2 < 1 holds if x < 
+, where N, M is the number of stochastic, deterministic steps, respectively, 
in one complete turn of iterations and gamma is a certain ratio depending 
on the parameters of the problem (20.1). 

(b) If the stochastic steps in (20.7) are taken at a decreasing rate, then the 
speed of convergence is increased from 67 = 0(z) in the pure stochastic 
case to b; = 0(k~*) with a constant 1 < \ < 2 in the semi-stochastic case. 


20.8 Construction of deterministic descent direction 


Up to now deterministic feasible descent directions may be constructed if the 
distributions PA(},b¢-)) are 


e stable [13] 
e invariant [15] 
e discrete [14]. 


The following implementation is based on the assumption that (A(w), (w)) has 
a m(n + 1)-dimensional 

normal distribution with (20.9) 
mean (A, }) and (20.9.1) 


Qiu Qia -** Qim 
- 2 on (20.9.2) 


covariance matrix Q = 

Qmi Qm2 ae Qmm 
where the (n+ 1) x (x +1) matrix Q,; denotes the covariance matrix of the 
a-th and j-th row (A;(w),6;(~)),(A;(w),6;(w)), resp., of the random matrix 


(A(w), 6(w)). 
Besides (20.9) we still suppose: 
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The objective function F' of (20.1) is not constant (20.10) 
on arbitrary line segments zy of R” . 
From assumption (20.9) follows that the random m-vector A(w)2 —(w) has a 
normal distribution with mean Az — 6 and covariance matrix 


PQunF LQige ++ BQime 
o~ a~ n~ 
#Q1F BQa9% ++: BQame 
Q: = ‘ : . 
2 Omi 2 OQmE 7 2’ Qmm= 


where @ = (*,). 
The key for the construction of descent directions is now this 
Theorem 8.1. Suppose that assumptions (20.9) and (20.10) are fulfilled. If 
n-vectors 2,y # z are related according to the relations 
Az = Ay (20.11.1) 
Qz — Qy is positive semidefinite, (20.11.2) 
then F(y) < F(z) and h =y — 2 is a descent direction of F at z. Moreover, if 
z € D and in addition to (20.11.1) and (20.11.2) we still have 
yeD, (20.11.3) 
then h = y — 2 is a feasible descent direction of F at z. 
Note For given x (20.11.1) is a system of m linear equations for y. Relation 
(20.11.2) means that the smallest eigenvalue of Q; — Qy is nonnegative. In 


the important special case m = 1, (20.11.2) is reduced to the single quadratic 


constraint 
#Quz>7Q9- (20.11.2a) 


If (A(w),6(w)) has stochastically independent rows, then (20.11.2) is equivalent 
to 
2Qu2>97 QF for alli =1,2,...,m. (20.11.28) 
In this case solutions y of (20.11) may be obtained by solving for given vector 
z the convex program 
minimize 9 Qigigf 
subject to 7 Qi97 < 2Q2, 1=1,2,...,m 


ha ae (20.12) 
Ay = Az 
y ED, 
where 1 < ip < m is a fixed integer. 
In the general case one has to consider the program 
minimize \(Q, — Qy) 
subject to Ay — Az (20.13) 


yED, 
where A(Qz — Qy) denotes the smallest eigenvalue of Q; — Qy- 
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20.4 Implementation 


20.4.1 Representation of the random matrix (A(w),6(w)) 
(A(w), 6(w)) is defined by 


(A(w}, 6(w)) = (A, 6°) 4 S33 (49,09). (20.14) 
j=l 


where (A,b’), 7 =0,1,...,7, are m x (n +1) matrices to be selected by the 
user and w!,w?,...,w" are independent normal random variables with mean 
zero and variance one. A realization (A(w,),6(w,)) of (A(w),6(w)) is then 
represented by 


(A(x), b(ee)) = (4°, 0°) + Sou} (4,6), 


ya 


where wy = (wh, wg,---, We), k = 0,1,..., 18 a sequence of stochastically inde- 
pendent realizations of the random r-vector w = (w!,w?,...,w") generated by 
means of a pseudo random generator (converting uniformly distributed pseudo 
random numbers into normal distributed ones based on the central limit theo- 
rem ). 


20.4.2 Computation of the search directions 


We suppose that 
rank A= rank A°o=m<n. 


The matrix A = (@,G,..-5Fm), Z = k-th column of A, must be partitioned 
by the user into a regular m x m matrix 


B = (Gq, Tkgs+++s Gem) 
and an m x (n — m) rest matrix 
E = (Ge, 1 Gngs+++s Een mm) 
The user has then to define the index set 
INDXAO = {k,, ka,- io5- Ra ighay sj Mammy 

Given the last iteration point z,, in the subroutine FUNCT a solution y; of 
the relations (20.11.1) - (20.11.38) is computed. 

At present only the case D = JR” is implemented. For sake of generality the 


system of relations (20.11) is solved by means of the program (20.13}. However, 
having a more special situation, the user only has to replace the procedure 
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(20.13) implemented presently in FUNCT by his own procedure for solving 


(20.11). 
If yx # te, then hy = ye — Zp is a feasible descent direction (see Theorem 
3.1) and the next iteration point 24+, is defined by 


Tht = e+ pk (Ye — Ze); 
where p; > 0 is a step size.. 
If y, = 2,, then FUNCT fails to find a descent direction. Hence, the next 
iteration point is defined by 
Th41 = Fk — peYe, 
where 


Y,€ A(we)'du(A(we) te - b(we)). 


20.4.8 Step size 
At present the step sizes py, k = 0,1,..., are defined by 


ae 
Pk = TF 


For a deterministic step the user may also take py = 1 or py = 0.5. 


20.4.4 Loss function u 
The following classes of loss functions are implemented: 


(a) Quadratic loss functions 
u(z) =c+qz+2W2z,zeER”, 
where c is a fixed number, g denotes an m-vector and W is a positive 


semidefinite m > m matrix. 
(b) Polynomial loss function 


m 
u(z) =) 22,2 = (z1,..-,2m)’ ER”, 
j=l 


where ¢ is a fixed integer. 
(c) Sublinear loss function 


= U m 
u(z) = max fizz Eh”, 


where fi, f2,-.-,/p are fixed m-vectors. 
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20.4.6 Stopping criterion 


The user has to select a (small) positive number EPS> 0, an integer ITMAX and 
a number TMAX. The procedure runs until the first of the following conditions 
is fulfilled: 


\|tx+1 — ell < EPS, 

k < ITMAX (= maximal number of iterations), 

T < TMAX (= maximal computing time), 
where || ¢ || denotes the Euclidean norm. 


Acknowledgment: The FORTRAN code was written by A. Bohme. 
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CHAPTER 21 


STOCHASTIC INTEGER PROGRAMMING BY 
DYNAMIC PROGRAMMING 


B.J. Lageweg, J.K. Lenstra, A.R. Kan and L. Stougie 


Abstract 


Stochastic integer programming is a suitable tool for modeling hierarchical de- 
cision situations with combinatorial features. In continuation of our work on 
the design and analysis of heuristics for such problems, we now try to find op- 
timal solutions. Dynamic programming techniques can be used to exploit the 
structure of two-stage scheduling, bin packing and multiknapsack problems. 
Numerical results for small instances of these problems are presented. 


21.1 Introduction 


Stochastic integer programming problems appear to be among the hardest prob- 
lems in the area of mathematical programming. Most research on these prob- 
lems has so far concentrated on the design and analysis of approzimation algo- 
rithms. A survey of recent work in this direction, illustrated on the probabilistic 
analysis of a two-stage scheduling heuristic, can be found elsewhere in this vol- 
ume [9]. 

In this chapter, we are interested in optimization algorithms for stochastic 
integer programming. The development of a reasonably efficient general pro- 
cedure for this purpose seems a tremendous research challenge. Our objective 
is more modest. We will consider stochastic integer programs of a very special 
structure. The stochastic parameters will have a discrete distribution with a 
finite number of points with positive density. Moreover, each realization will 
lead to a combinatorial optimization problem that is solvable by a dynamic 
programming routine. The overall stochastic optimization problem will then 
be solved by a single giant recursion that combines the separate dynamic pro- 
gramming computations for all the individual realizations. This can be done 
only for problem instances of a relatively small size. Still, our numerical re- 
sults give valuable insight into the shape of value functions of stochastic integer 
programming problems. 

The following three sections illustrate our approach on two-stage schedul- 
ing, bin packing, and multtknapsack problems. In each section, we first formulate 
the problem in question, then present the dynamic programming algorithm, and 
finally discuss our numerical results. We note that the computational experi- 
ence was obtained by improved implementations of the basic recursions, the 
technical details of which can be found in an extended version of this paper [7]. 
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21.2 Scheduling 


21.2.1 Problem Formulation 


The two-stage scheduling problem studied in this section was first formulated 
in [2]. At the aggregate level, one has to decide on the number X of identical 
parallel machines that are to be acquired, while knowing the cost c of a single 
machine, the number n of jobs that are to be processed, and the probability dis- 
tribution of the vector w = (w,...,W,) of their processing times. It is assumed 
that the w; are independent and identically distributed random variables with 
expectation yw. At the detailed level, after X has been determined, a realization 
w € 2 of w becomes known, where © denotes the set of all realizations, and 
one has to decide on a schedule in which each machine processes at most one 
job at a time, job 7 is processed during an uninterrupted time period of length 
w; (gj =1,...,n) and no job is processed prior to time 0, so as to achieve a 
minimum value Y*(X,w) of the maximum job completion time. The total cost 
of the acquisition decision X and the optimal scheduling decision is denoted by 
V*(X,w) =X + Y*(X,w). 

In the two-stage decision model, the objective is to determine a value X* € 
IN such that the expected total cost is minimized: 


EV*(X*,w) = min {EV*(X,w)}. 


In the distribution model, the objective is to determine a function X° :9 — IN 
such that for each w € 1 the actual total cost is minimized: 


V*(X°(w),w) = min {V*(X,»)}, Wwe. 


Previous work on this problem concerned the design and analysis of a two- 
stage heuristic [8]. This heuristic sets the number of machines equal to the value 
of X that minimizes the lower bound V8 (X) =cX +np/X on EV*(X,w) and 
assigns the jobs to the machines by a list scheduling rule. (In our computational 
experiments, we used the longest processing time rule, which puts the jobs on a 
list in order of nonincreasing processing times and successively assigns the next 
job on the list to the earliest available machine; this rule has a better worst 
case performance than arbitrary list scheduling [5].) The relative error of the 
heuristic tends to 0 as n tends to infinity for various measures of stochastic 
convergence [8]. 
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21.2.2 Dynamic programming 


The second stage scheduling problem of determining Y*(X,w) for given X and 
w is NP-hard [4]. We will consider the situation in which the processing times 
can assume only & distinct values a,,...,a;, for a fixed value of k. Let us 
denote by w =[ni,...,%| the vector of processing times in which the value a; 
occurs 7; times, for 7 =1,...,k. 

One can obtain an optimal schedule on X machines by assigning a certain 
subset of jobs optimally to X — 1 machines and putting the remaining jobs on 
another machine. This observation leads to the following recurrence relations: 


Y*(X,{ni,...,m4]) = min{max{Y*(X —1,[n) — 2,,..., 2% — &]), 
Y*(1,[21,..-5 2e])} 
jo < cz, < n;(j =1,...,k)}(X > 1), 


k 
Y*(1,[mi,.. +) Mr]) = So njay. 
j=l 


Computation of Y*(X,w) by a dynamic programming algorithm based on this 
recursion requires O(X Ie n;) time, which is exponential in & but polynomial 
for fixed k. 

In the more general context of the two-stage scheduling problem, we assume 
that the processing times have a discrete distribution with / integral values 
@1,...,@ in its support. The independence of the processing times implies 
that w = {n,,...,n,] has a multinomial distribution. The idea is now to go 
through the entire recursion once in order to compute Y*(X,w) for all values 
X € {1,...,n} and for all realizations w € 0, where 1 is given by 


1 = {{r1,...,ne]|0 ny Sn(j =1,...,k), m1 t--- +n =n}. 


The distribution model is then solved by the selection, for each w € Q, of a value 
of X that minimizes V*(X,w) = cX +Y*(X,w). The two-stage decision model 
is solved by the determination of a value of X that minimizes EV*(X,w) = 
cX + Veg Pr{w =w}¥*(X,w). 

A straightforward application of the above dynamic programming algo- 
rithm requires O(n*) comparisons for each of the O(n*+1) pairs (X,w), and 
hence O(n?*+) time altogether; the multinomial probabilities are easily com- 
puted within this time bound. A more efficient implementation reduces the 
overall running time to O(n?k) [7]. 
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21.2.8 Computational results 


The dynamic programming algorithm was coded in PASCAL and run on a CD 
Cyber 170-750 to solve several instances of the two-stage scheduling problem. 
The solution of instances with 100 jobs and two possible processing time values 
or with 50 jobs and three processing time values required about 30 seconds. The 
values of & considered are admittedly small, but the values of n are realistic 
and the running times are such that our brute force approach should not be 
dismissed on grounds of manifest inefficiency. 

We illustrate the numerical results on a set of representative instances given 


by 


e=1, 
n=1,...,100, 
k = 2,0, = 18,09 = 14,Pr{w; =a} = Pr{w; =ag} = 4(j =1,...,n). 


Figure 21.1 shows four functions of the number of jobs: 

- the minimal lower bound minx {V43(X)} mentioned in Section 21.2.1; 

- the minimal expected total cost EV*(X*,w) (the optimum for the two- 
stage decision model); 

- the expected minimal total cost EV*(X°(w),w) (the optimum for the dis- 
tribution model, averaged over all realizations); 

— the expected approximate total cost obtained by the heuristic mentioned 
in Section 21.2.1. 


Note that the last three functions are defined only for integral 1; linear inter- 
polation has been applied to improve the presentation. The distribution model 
yields slightly better results than the two-stage decision model on average, as 
expected. A comparison between the optima and the lower and upper bounds 
confirms that the absolute differences are significant while the relative differ- 
ences disappear with increasing problem size. 

For the case that n = 100, Figure 21.2 shows three functions of the first 
stage decision variable, the number X of machines: 

- the lower bound V“?(X); 

— the expected total cost EV*(X,w) in case of an optimal second stage de- 

cision; 

— the expected total cost in case of an approximate second stage decision. 
Note that we have interpreted X as a continuous variable: acquisition of a 
fractional machine costs a fraction of c but yields no benefit at the second 
stage; the vertical line segments correspond to discontinuities. In spite of the 
smoothing effect due to averaging over all realizations, both the optimal and 
the approximate cost functions are highly nonconvex and multimodal. The 
functions consist of a first stage component, which is linear and increasing, and 
a second stage component, which is nonconvex and nonincreasing. Addition of 
the two components can turn the nonconvexities into local minima, and small 
values of ¢ appear to be most effective in this respect. 
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21.8 Bin Packing 


21.8.1. Problem formulation 


The two-stage bin packing problem is formulated as follows. At the aggregate 
level, one has to decide on the capacity Y of bins, while knowing the cost d 
of one unit of capacity, the number 7 of tems that are to be packed into the 
bins, and the probability distribution of the vector w = (WwW ,...,Wn) of the 
item weights. It is again assumed that the w,; are independent and identically 
distributed random variables with expectation y. At the detailed level, after 
Y has been determined, a realization w € 1 of w becomes known, and one has 
to decide on a packing in which each item is assigned to a bin and the total 
weight of the items assigned to the same bin does not exceed its capacity Y, 
so as to achieve a minimum number X*(Y,w) of bins needed. The total cost of 
the first stage decision Y and the optimal second stage decision is denoted by 
W*(Y,w) =d¥ + X*(Y,w). 

In the two-stage decision model, the objective is to determine a value Y* € 
R., such that 

EW*(Y*,w) = Fa AEC Ose): 


In the distribution model, the objective is to determine a function Y° :w +R, 
such that 
W*(Y¥°(w),w) = rae A WGe ys Vu € 2. 


This problem is the symmetric counterpart of the two-stage scheduling 
problem from the previous section. One can view items as jobs, weights as 
processing times, bins as machines and their capacity as ajob completion dead- 
line, but now the order of the decisions is reversed. In fact, the above cost 
structure is quite natural in this context. First, a delivery date for the jobs is 
negotiated, whereby the cost of extending this date by one unit is independent 
of the number of machines that will turn out to be needed later on. 

In analogy to the two-stage scheduling heuristic given at the end of Section 
21.2.1, one can consider the following two-stage bin packing heuristic. The 
bin capacity is set equal to the value of Y that minimizes the lower bound 
W*B(yY) =d¥+nu/Y on EW*(Y,w), and the items are packed into bins by the 
first fit decreasing rule, i.e., the items are taken in order of nonincreasing weights 
and each next item is assigned to the first bin that has enough capacity to 
accommodate it. This heuristic can be shown to have several strong properties 
of asymptotic optimality [10]. 
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21.8.2 Dynamic programming 


The second stage bin packing problem of determining X*(Y,w) for given X and 
w is NP-hard [4]. We will again consider the situation in which the stochastic 
parameters can assume only & values a),...,@%, for a fixed k, and write w = 
[n1,-..,m%] to denote the vector in which the value a; occurs n; times, for 
HAs ke 

The following dynamic programming algorithm given in to [6]. Let C(Y,w) 
be the total amount of capacity needed to pack items with weights specified by 
w into bins of capacity Y. It is assumed that C(Y,w) includes the slack capacity 
of each bin (which is equal to Y minus the total weight of the items assigned to 
that bin) except for the slack capacity of the last bin. Thus, if C(Y,w) = XY -T 
with X € Z, and 0 <I < Y, then an optimal packing requires X bins and the 
last bin has a slack capacity of [. Let A(Y,w,a) be the extra capacity needed 
when an item with weight a is added to this packing: 


J _fa ifl >a, 
A(Y,w, a) tee iT <a. 


It is not hard to see that 


C(Y,|ni,...,ne]) = ee gee ey = 1, j41,+++) Mk) 
+A(Y,[n1,...,nj-1.0; - 1, j4i,+++ye],a;)} 


(ni +...+n% > 0), 


C(¥,|0,...,0]) =0. 


We finally have that X*(Y,w) = [C(Y,w)/Y]. 

For the two-stage bin packing problem, we make the same assumptions con- 
cerning the distribution of the stochastic parameters as in Section 21.2.2 and 
apply the same strategy to obtain solutions to both stochastic optimization 
models. Since the values a),...,@, are integral, there is no loss of generality in 
considering only integral capacities Y. Let amax = max{a@,,...,a,} and note 
that 1 < Y < namax. The algorithm requires a fixed number of comparisons 
for each of the O(n*t!amax) pairs (Y,w), and hence O(n**amax) time alto- 
gether. A more efficient implementation reduces the overall running time to 
O(nk+ 0/2) gil? d-(1/2)) [7]. 

Due to the relation between the two-stage scheduling and bin packing prob- 
lems that was observed above, the Y *(X,w) values from Section 21.2.2 could be 
used to derive the X*(Y,w) values needed here and vice versa, as long as the set 
{a,,...,a,} is the same in both cases. The former recursion has the advantage 
of requiring strictly polynomial time; the latter one is pseudopolynomial but 
much faster for small values a,,..., ax. 
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21.3.8 Computational results 
For the typical problem instance given by 


d=1, 
n = 100, 
k = 2,4; = 18,49 = 14, Pr{w; =a,} = Pr{w; =a.} = $= Ln, 


Figure 21.3 shows three functions of the first stage decision variable, the capac- 
ity Y: 

~ the lower bound W“?(Y); 

— the expected total cost EW*(Y,w) in case of an optimal second stage de- 

cision; 

— the expected total cost in case of an approximate second stage decision. 
An investigation of these and other results leads to the same conclusions con- 
cerning running time, quality of lower and upper bounds, and the occurrence 
of multiple local minima as in Section 21.2.3. 


21.4 Multiknapsack 


21.4.1. Problem formulation 


The two-stage multiknapsack problem that we will consider here can be viewed 
as a capital budgeting problem. At the aggregate level, one has to decide on the 
sizes X1,...,Xm of m budgets that are to be reserved for financing a number 
of projects, while knowing the cost c; of reserving one unit of budget ¢ (¢ = 
1,...,m), the reguirement r,; of project 7 out of budget ¢ (¢ =1,....,.m,7 = 
1,...,%), and the probability distribution of the vector w = (w ,...,Wn) of 
revenues that the projects will yield. lt is assumed that all c;, r;; and w; are 
nonnegative and that the r;; are integral. At the detailed level, after X¥ = 
(X1,...,Xm) has been determined, a realization w € Q of w becomes known, 
and one has to decide on a selection S of the projects that maximizes the total 
revenue Y*(X,w) within the budget constraints: 


¥*(X,w) = cme, OU w;| Dri <X; (i=1,...,m)}. 
ges JES 
The total profit of the budgeting decision X and the optimal selection decision 
is denoted by Z*(X,w) =— yn, Xi + ¥*(X,). 
In the two-stage decision model, the objective is to determine a vector 
X* ER such that 


EZ*(X*,w) = yinax, {E2Z"(X,w)}. 
+ 


In the distribution model, the objective is to determine a function X° : — R? 
such that 
* 0 —_ * 
2" (X° (w),w) = yan 2 (X,w)}, Wen. 
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21.4.2 The distribution model 


The knapsack problem, i.e., the second stage problem with m = 1, is already 
NP-hard [4]. Surprisingly, the distribution model is easily solved to optimality. 
For each w € Q, the selection S(w) of profitable projects is given by S(w) = 
{j|w;-D=, eeriy > 0}. The miniroum budgets needed to finance these projects 
are equal to X?(w) = ies w) rj;(¢ =1,...,m), and the corresponding total 
profit. is 
m 
Z*(X°(w),w) = >. (wj - Yeti), Wwe. 


JES (w) i=1 


In the situation that each revenue w, can assume only & distinct values, the 
determination of X° requires O(mn) computations for each of &” realizations 
aw. 


21.4.8 Dynamic programming 


The second stage multiknapsack problem is solvable by a classical dynamic 
programming algorithm from [1]. Let Fj(X,w) be the maximum revenue if 
only the first 7 projects can be selected, for given budgets X = (X1,...,Xm) 
and revenues w = (w1,...,W,). An optimal selection is either restricted to the 
first 7 — 1 projects or includes project j: 


F;((X1,---,Xm),w) = max{Fj-1((X1,---,Xm),¥), 
Fy-1((X1 —1ijs.--)Xm— my), e) tej} (J =1,...50), 


Fo((X1,---,Xm),) > ee otherwise. at 


Since the requirements r;; are integral, also the budgets X; can be assumed to 
be integral. Computation of Y*(X,w) = F,(X,w) requires a single comparison 
for each of Ji", X; vectors X’ < X at each of n successive stages, and hence 
O(nJ];~, Xi) time altogether. 

For the two-stage multiknapsack problem, we again consider the situation 
in which each revenue w; can assume only & distinct values, for a fixed &. Let 
R; = Vj=1 tij and note that 0 < X; < Ri(i=1,...,m). At stage 7, only the 
ki} different realizations of (wi,...,w,) need to be distinguished (j = 1,...,n). 
The algorithm therefore has to consider O(k? J], i) pairs (X,w) at stage 7. 
Summation over all 7 yields an O(k" J], R;) time bound for the computation 
of all Y*(X,w) and also for the determination of a budget vector X* that is 
optimal in expectation. 
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21.4.4 Computational results 

The dynamic programming algorithm was coded in PASCAL and run on a CD 
Cyber 170-750 to solve several instances of the two-stage knapsack problem. 
We set m = 1 at the outset and did not attempt to solve proper multiknapsack 
problems, for which m > 2. We assumed independence of the revenues w; and 
tried to make the second stage knapsack problem nontrivial by specifying a 
high correlation between the expected revenue Ew; of project j and its budget 
requirement r,;. The solution of instances with twelve projects and two possible 
revenue values for each of them required about ten seconds. 

For the problem instance given by 


m=t1,¢c=1, 


n =12,Pr{w; = a1;} = Pr{w;=a,}=4 (J =1,...,n), 


with the values of r1;, @1;, @9;(7 = 1,..-,”) given in Table 21.1, Figure 21.4 
shows the expected total profit EZ*((X1),w) as a function of the budget size 
X,. Note that the profit is shown only for integral X,; the line segments that 
start from the points shown with a slope —c, and that indicate the profit for 
fractional X, have been deleted. Even if we restrict our attention to integral 
values of X), the profit function has many local maxima. 


Table 21.1 Knapsack: numerical data 


j 1 2 3 4 6 6 7 8 9 10 ii 12 
Ty § 2 9 13 10 8 47 10 6 4 9 
aij 7 4 12 17 158 12 5 9 14 9 6 Ii 
a7 8 1 OT Bo Ta ae OF ek 
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PART IV 


Applications and Test Problems 


CHAPTER 22 
FACILITY LOCATION PROBLEM 


Yu. Ermoliev 


22.1 Introduction 


The public provision of urban facilities and services often takes the form of a 
few central supply points serving a large number of spatially dispersed demand 
points. These facilities include hospitals, schools, libraries, and emergency pro- 
visions such as fire and police services. One of the fundamental features of 
these systems is the spatial interaction between suppliers and consumers. The 
need to introduce behavioral patterns more realistic than simply assuming that 
customers use the nearest facility has been recognized by many authors, among 
them Coelho and Wilson [4], Hodgson [7], Beaumont [1], and Leonardi [{8],{9]. 
Since the proposed spatial interaction (“gravity”) models can be justified both 
theoretically and empirically, their use in location modeling seems promising. 

However, the classical spatial interaction models solve only part of the 
problem. Although they are based on stochastic assumptions [14], [11], [8] they 
use only the expected values of the underlying stochastic processes. A natural 
further step is therefore to introduce the stochastic behavior explicitly, thus 
allowing for uncertainty in both customer choice and demand knowledge. This 
was the approach in papers [8],{6]. The aim of this paper is to describe some of 
the problems arising when such stochastic features are introduced and to dis- 
cuss the computational feasibility of stochastic quasigradient (SQG) methods. 
Practical results obtained in [6] are presented for a stochastic problem which 
deals with the optimal size of school facilities. Real data from Turin, Italy, 
have been used in the tests and the results are compared to those obtained by 
other methods. Some results reported on involve objective functions that are 
not even continuous. The contents of this chapter follows papers [5],[6]. 
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22.2 Statement of the Problem 


The simplest formulation of the deterministic facility location problem is as 
follows: minimize the performance function 


ET In(z,5) + ciz2iy] (22.1) 
ig 
subject to the constraints 
n 
So ny = 41,4 = Tr, (22.2) 
j=l 
aj; 20, Vi,9, (22.3) 


where 2,; is an (unknown) expected flow of users from demand location i to 
facility location j(i = I,r,7 = 1,n) per unit time; a; is the total demand (in 
terms of customers to be served per unit of time) at each demand location 3; 
cjj are the costs of travel between each pair of locations (2,7). 

The objective function (22.1) was first introduced into transport planning 
evaluation by Bregman [3] and Neuburger [18] and extended to location analysis 
Coelho and Wilson [4]. These authors gave this function an economic inter- 
pretation, namely the consumer surplus measure associated with the pattern of 
consumer trips {z;;}. 

Due to the simple form of the problem (22.1)—(22.3), the closed-form opti- 
mal solution is not hard to find: 


aij = ajPij,2; = Yas (22.4) 
where 
_ _exp(—<e;;) 
P= 
Ye jexp(—¢;,) 


and 2; is the size of the facility at 7. Note that the quantities P;; satisfy the 
following conditions: 


UP >0,¢=1,7,j =1,7 (22.5) 


Equations (22.4) and re imply that trips from demand locations to facilities 
are made according to a very simple interaction rule. The quantity P;; can be 
interpreted as the probability that a customer living at location ¢ will choose the 
facility at location 7. Then 2,; is the expected number of customers traveling 
between ¢ and j. 

It is worth noting that the interpretation of the quantities P;; as proba- 
bilities is connected with the theory of probabilistic choice behavior [12]. It 
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has also been shown by Bertuglia and Leonardi [$8] that these quantities can be 
considered as a steady-state distribution of a suitably defined Markov process. 

It is now possible to use equation (22.4) as the basis from which to make 
some generalizations concerning stochasticity. The simplest of these are as 
follows: 


(1) The demand a, at demand location 7 is not known in advance; it is a 
random variable. This assumption is reasonable in many long-term plan- 
ning applications. For instance, in a high-school location problem the total 
number of students living in each demand location may change over time 
and so cannot be known in advance. 

(2) Customers living in district ¢ choose their destinations 7 independently of 
each other with probability P,;. 


These assumptions are embodied in the following model, which assumes 
that the choices made by the customers are stochastic. Let ¢; be the actual 
(random) numbers of customers traveling from i to 7 and define r;, the total 
number of customers attracted to j, as follows: 


Note also that 


n 
Yas =ani=Tr. (22.6) 
j=l 
Let H;(y) denote the distribution function of 1;: 


H;(y) = P{r; Sy}. 


The distribution function H;(y) cannot easily be given in closed form, but 
random draws of 1; can be computed using a simple simulation model based 
on equation (22.6). If 2; is the planned size of the facility at j, then the actual 
number of 7; of customers attracted to 7 may not be equal to z;. Suppose that 
a cost 

a} (tj 7) 


has to be paid when 2; > 1; and a cost 
a; (ty — 25) 
has to be paid when 2; <1;. We therefore have the cost function 


ay (2; =), ia; 21; 
(1; — 2;), if rwy< Tj, 


gt 


F;(2j,7;) | 
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The resulting stochastic programming problem is then as follows: determine 
the sizes z; of the facilities 7 = 1,n that minimize the expected cost 


F(a, ..-2,) = Y Files) 


gel 
n zy 
=) e3 / (x; — y)dH;(y) (22.7) 
j=l a 
a0 
+a; ] (y—2,)4dH;(y)] 
#5 
subject to constraints 
2; >0,j =I,n. (22.8) 


Note that the objective function contains no spatial interaction embedding term 
since the behavior of the customer is included in the structure of the probabil- 
ities P,;. 

Practical problems that lead to the minimization of a function such as 
equation (22.7) are common in operations research. For example, we could 
consider a facility allocation problem or a storage inventory control problem 
where some capacities have to meet random demand and both surpluses and 
deficits incur penalty costs. 

In the special case where F(z) has continuous derivatives, minimization 
of F(z) by analytical means would lead to the consideration of the partial 
derivatives 


a ot fan. - |” an. 
sas Fle) = aj |” atta) - 05 [ iy 


The solution would then require the determination of z = (x, ...%,) such that 





In general it may not be possible to solve this equation analytically (for instance, 
if H;(y) is unknown, as in problem (22.7)—(22.8)). 
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22.8 The Stochastic Quasigradient Method 


The problem (22.7)-(22.8) contains two typical difficulties of stochastic pro- 
grams (see Chapter 1). First, it is difficult or impossible to compute the exact 
values of the integrals appearing in (22.7), except for special and well-behaved 
forms of the distribution functions H;(y). Actually the functions H, are defined 
only by means of arule for generating random draws by Monte-Carlo-type simu- 
lating procedures. Thus, to solve such problems it is necessary to use procedures 
which do not calculate the exact values of the objective functions. Second, the 
objective function (22.7) is generally nonsmooth. 

This becomes clear after reformulating problem (22.7)-(22.8) as a sto- 
chastic minimax problem. It is easy to see that 


f;(25, 173) = max{a} (2; — r)),0; (rj — 2y)}- 


The objective function (7) is therefore 


F(z) = D7 Emax{a; (23 — 13); (1; - 2;)} (22.9) 


Function (22.9) is convex, but in general nonsmooth, since the maximiza- 
tion operator is present under the mathematical expectation sign. The sto- 
chastic quasigradient method for this particular stochastic minimax problem 
works as follows. 

Let 2° = (z?...29) be an arbitrary initial approximation and 2° = 
(z?...2°) be the approximation computed after the e-th iteration. A random 
observation r* = (r/...72) of the vector r = (r)...7n) is obtained by simula- 
tion. A new approximation is determined by the rule: 

eet = max{0,25 —p.€}}, j=I,n, #=0,1... (22.10) 


where p, is a step multiplier, such that 


Ps = 0,>, fe = oo, Ep < 00, (22.11) 


In principle, the convergence of {x*} will be obtained if the step multipliers 
fs are chosen so as to satisfy the step conditions (22.11). For the practical 
construction of the step-size control, these requirements are only of general 
importance. 
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Figure 22.1 The behavior of the sequences {F,} and {E,} as a function of 
the iteration number. 


22.4 Practical Computations 


The methods of controlling the step size in stochastic minimization are usually 
based on keeping the step multipler constant during a number of iterations and 
then reducing it according to certain rules. In the course of the iterations a 
succession of function values F, = L,f;(2},17) are observed. Usually these 


J 
values vary over a wide range. However, the sequence 


k 
Ee= Zh E dL Liileht) (22.12) 


shows a smoother behavior as can be seen in Figure 22.1. Indeed, Ey could be 
expected to approach a stationary value. One rule of controlling the step size 
uses on this fact. The method can be summarized as follows: 


(1) Choose the initial value p° for the step multiplier 

(2) Using p° for the step multiplier calculate the value of Ey according to 
equation (22.12) 

(3) When a stationary sequence {Ex} is observed, reduce the step multiplier 
by one half 

(4) Go back to step (2) with the new value of step multiplier until no improve- 
ment in the test function Ey is observed. 
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There are some unanswered questions in the procedure outlined above. 
First, how should the initial step multiplier be chosen? If it is too large, both 
the sequence {£,} and the iterates z° will oscillat heavily and no decrease 
in the objective function will be observed. If the initial step multiplier is too 
small, the rate of decrease will be very small and perhaps hardly noticeable. 
From the computational point of view the latter situation is more harmful and 
should be avoided, while the situation arising from too large a step multiplier 
is rapidly recognized and hence can be corrected. As arule of thumb the initial 
step should be chosen to satisfy 


pl; pe; (22.13) 


where y € (0,1) and 2; is the estimated value for the j-th component of the 
solution. 

The use of step p, also needs further explanations. The ideal way of con- 
trolling the procedure would be on-line, where the program continuously plots 
the values of the sequence {E,} on the screen and where the iterations could 
be manually interrupted to cut down the step multiplier. This is not always 
possible and the iterations must be performed in small batches, whereafter the 
values of FE; are plotted and possible adjustments of the step multipler can 
take place. A definite way to find the stationary phase of the sequence is to 
rescale the coordinate axes before plotting the values of a new batch. In this 
case the stationary phase is in fact recognized as smooth oscillations around a 
fixed value. 

Figure 22.2 shows an example of the behavior of Ey as a function of a 
iteration number &. The values for coefficients are ay =a; = 1.00, j= 
1,...,23, p° = 1.00, and the components of the initial estimate and the solution 
are known to differ by at most five units. Note that the rate of decrease of the 
sequence {E,} is fast during the first iteration batches but becomes slower as 
the step size decreases. Hence a crude estimate of the result is obtained after 
a rather small number of iterations, but for greater accuracies the number of 
iterations needed grows rapidly. 

If rigorously followed, the basic procedure for the step-size control may 
lead to a slow performance of the algorithm. First, the manual step-size control 
with many I/O operations requires considerable effort from the person who 
does the calculations and usually this affects the response time. This happens 
especially in a time-sharing computer environment where the number of users is 
large and the average response time is already quite long. Second, the number 
of iterations needed can be often be significantly reduced. 

To overcome the need for numerous manual I/O operations, a simple au- 
tomatic version of the manual step-size control can be designed. Given three 
parameters the procedure simulates the behavior of the controlling person and 
reduces the step multiplier as soon as it observes a stationary or an oscillatory 
sequence {Ey}. Let the three input parameters be NB, DIFI, and DIF2. The 
first parameter NB fixes the batch size, i.e., the iterations will be performed 
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Figure 22.2 The convergence behavior of {£;,} in the manual control and 
simulated manual control cases. 


in batches of NB iterations. Let the step multiplier used during the iteration 
batch be equal to p. A test indicator is defined as: 


_ Eym-1)-NB-Em-NB 


d m=l1,,.. 22.14 
a To pee ee 
The procedure then checks the two conditions 
dm < DIF1 (22.154) 
and > oe 
—_—seM —__*____ > pIF2 22.156 
max,cm E, —mingem by ~ ( ) 
where 
At E, = max(0,E, — E,-1) (22.164) 
M = {s|(m—1)- NB <8<m- NB} (22.165) 


In the case when either of these conditions holds the step multiplier is reduced 
by one half. The first condition (22.15a) tests if the decrease of the sequence 
in proportion to the step-size used is less that the given limit. The second 
condition (22.15b) then checks if the sequence is oscillatory. This is done by 
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considering the ratio of the sum of the positive jumps of the sequence {Ex} 
to the maximum change in the sequence that takes place during the iteration 
batch. 

With DIF1 = 0.01 and DIF2 = 0.30 the procedure simulates the manual 
control very closely (Figure 22.2). Depending on the starting values used for 
2° and p° sometimes a few more iterations were performed than the manual 
control would have required, but the total computing time still usually remained 
smaller than in the case of manual control. 

With the aforementioned values for DIF1 and DIF 2 the automatic step-size 
control normally guarantees that the solution is eventually reached, indepen- 
dent of the initial values for z° and p®. Often the algorithm can be made faster 
by using a greater value for DIF1. lf for example, DIF1 = 1.00, the use of 
the control would reduce the step multiplier as soon as the total decrease of 
the objective function during a batch is less than the total change of the com- 
ponents in that batch. If the solution can initially only be roughly estimated, 
the number of iterations can be kept of moderate size. This can be done by 
choosing an initial value for » that will reach the solution region in a few iter- 
ations and by cutting down the step size as soon as the rate of decrease of the 
objective function slows down. Using the test indicator dm of equation (22.14) 
the program checks if 

dm < DIF1 (22.174) 


or 
dm S dm —1 (22.176) 


Instead of E,,, an average of a few neighboring values of E,, can be used to 
calculate the indicator d,,. If any of conditions (22.17) holds, the step multiplier 
is cut down by a factor r, which is given as an input. 

The effect of the accelerated procedure is seen in Figure 22.3 where the 
curves correspond to the accelerated step-size control. The reduction coefficient 
ris 0.5 in both cases, but in the first case, the batch size is 10, in the latter case, 
5. DIF1 has now been set to 1.0. It is seen that some decrease in the number 
of iterations has been obtained in both cases as compared to the situation in 
Figure 22.2, but the difference is quite small. However, in this example a good 
estimate of the solution is known in advance and the number of iterations is 
rather small with any kind of step-size control. Note that if the initial estimate 
for z is far from the actual solution and a small initial value is used for p, then 
the accelerated procedure may reduce the step too rapidly, and an excessive 
number of iterations is needed to obtain the solution. As noted earlier, this 
danger can be normally eliminated by selecting an initial p° estimate that is 
too big rather than too small. 
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Figure 22.8 The convergence behavior of {£;} in the accelerated step-size 
control case. 


22.56 A Case Study 


An example of a resource allocation problem (see [6]) that minimizes costs 
to meet uncertain demand will be discussed in this section. The problem is 
a high school location problem in Turin, Italy. The physical setting and the 
data for this problem are described in Leonardi and Bertuglia [10]. For the 
purposes of this analysis, Turin is divided into 23 districts, each district being 
both a demand source and a possible high school facility location. Customers 
are assumed to behave according to a gravity-type model. For simplicity, travel 
time is assumed as the only explanatory variable for the choice behavior (some 
theoretical underpinnings for such models are described in Leonardi [9]). 

However, unlike in the standard models, the gravity model will be given 
here a stochastic interpretation, as suggested in Section 22.2. That is, the rela- 
tive distribution of students among facilities is a discrete multinomial Bernoulli 
distribution, rather than as a set of deterministic fractions. in mathematical 
terms this takes the following form. 

Let a;,2 =1,...,7, be the total number of students at point 7. The problem 
is to determine the size 2, of the facilities at points y,7 = 1,...,n, when it is 
known that the students at point ¢ choose the facility at point 7 with probability 


ef 


j=l 


hy = 
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where \ is a constant and c;; are empirical coefficients that depend on the 
distance between i and j (in this example: travel times in minutes). The use of 
(22.18) for the probabilities has theoretical and empirical justifications. Model 
(22.18) is a simplified form of the logit model discussed in McFadden [11], [12], 
for example. If the flow of students between ¢ and j is denoted ¢,;, the stochastic 
demand at point 7 is then 


- 
r= Sei (22.19) 
r=1 


while the number of students at point ¢ can be written as 
n 
a=) ei; (22.20) 
j=l 


The numbers a; are now deterministic and given as input. If the unit cost of 
capacity surplus is a and that of deficit is and no other costs are considered, 
then our cost minimization problem is of the type (22.7) with ay = a,a; 
8,9 =1,...,n. 

The ability to generate random realizations, ced , of the demand vector is 
essential for applying the quasigradient method. The direct determination of 
the distribution of 7; is practically quite difficult in this case. Instead, ran- 
dom vectors can be generated by simulating individual choices of the students 
according to the probabilities p;; in (22.18). 

Table 22.1 shows the solutions obtained for a = # = 1.0. In this case 
the solution 2; = L,a; - py of a deterministic problem that is based on an 
entropy approach. The first column in Table 22.1 contains the labels of each 
district, numbered from 1-23. The second column of Table 22.1 gives the vector 
@ = (a1,...,4g3) of total demands in each district; a was also used as the 
initial estimate for the iteration. Here the original data from Turin have been 
roultiplied by 1/100. The next three columns show the results originating from 
the use of different starting values for the iteration. The last column shows 
the solution based on the deterministic model. In general, a good agreement 
exists between all the solutions; they are usually within two digits of each 
other. There are, however, some significant discrepancies. These can be partly 
explained by the stochastic nature of the convergence and by the flatness of the 
objective function near the solution. They associate somewhat with the slow 
convergence of the algorithm as the number of iterations increases. 

The discrepancies between the solutions in Table 22.1 can be associated 
with the shape of the probability densities underlying the probabilities of (22.18). 
The values that are used for the coefficients ¢;; are listed in Table 22.6, the 
value of the constant \ is 0.15. Probability densities can be numerically ap- 
proximated from this data. Densities for several of the components are drawn 
in Figure 22.4. The densities are mostly symmetric and strongly peaked. In 
these cases the stochastic minimization solution, which corresponds to the me- 
dian of this distribution, and the deterministic solution, which corresponds to 
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Table 22.1 Optimal location of turin high schools. Solutions obtained for 
penalty costs a = @ = 1.0. 





Determin- 
Number of istic 
District students solution 





1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 


bw bw bt 
won 








the expected value, should be close to each other. This is in fact demonstrated, 
for instance, by the facility sizes in districts 8 and 9, where the discrepancies 
are small. However, for district 1 the density is flat and skew, and the median 
and expected values are not equal. On the other hand, in the solutions for 
2, the discrepancies are large. The flatness of the densities also explains the 
large discrepancies between the different solutions obtained from the stochastic 
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minimization procedure. 


Probability Density, p,(c2), 





Random Variable, w; 


Figure 22.4 The probability densities for random demand 7; at location 
j = 1.8,079. 


In Table 22.2 solutions are presented for cases where a and § differ from 
each other. As one could expect, the increase in the relative cost of deficit 
compared to the cost of surplus leads to larger values in the solution vector. If 
however, the probability density of the corresponding component of 1; is very 
peaked, as in the case of r21, the change in the relative costs does not have any 
significant influence on the solution. 
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Table 22.2 Optimal location of Turin high schools. Solutions obtained for 
different values of penalty costs a and f. 


District | F=EESS | GERM | F=SLdd | FH TB 








1 
2 
3 
4 
5 
6 
7 
8 





22.6 A Nonconvex Objective Function 


The problem discussed so far lacks some of the main features that are usually 
considered typical for optimal location problems. For instance, economies of 
scale, which make location problems nontrivial, are absent in our earlier for- 
mulation. In deterministic models, economies of scale are usually introduced 
by means of fixed charges, to be paid when a facility is established, no matter 
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what the number of attracted customers. This formulation is typical of the well 
known plant-location problems of operations research. Related ways to intro- 
duce scale effects are by means of suitable constraints, as on the total number 
of facilities or on the minimum feasible size for facilities. 

Here the first formulation will be explored. Let a fixed cost 7 be defined, 
to be paid when a facility is established. For simplicity, let us assume that the 
same value of + applies to all districts. Then the minimization of the expected 
cost calls for finding the minimum of the function. 


G2) = > 16(2;) + {YL ymaxfale;—13),8(r5—23)]} (22.21) 


where 5(z) is the unit step function at zero. It is easy to see that with non- 
negative 2;,G(z) is not convex and usually has several local minima. The 
problems of this form are normally treated with mixed integer programming 
methods. Here we attempt to apply the general idea of stochastic quasigra- 
dients to finding the global minimum. Approximating the step function by a 
logarithmic function, the estimate 





ga 


. 6 & 
= (2 if 2 <r} 
J 6 

zi te 


“8 ite>r (22.22) 
with ¢ a small positive constant used in computing the generalized gradient at 
z = 2°. Otherwise the procedure follows the gradient calculated by equation 
(22.10). 

In general, the procedure rapidly find a minimum which is at least local. 
After that, however, some difficulties arise with the control of the iteration 
process. In principle, the approximation 


n kon 
Gi(e) =7 > (25) + >> S| max[a(2? — 73), B(rF — 25)] (22.23) 
j=1 


6=0;=1 


can be used again to follow the progress of the iterative scheme. Now, however, 
after a number of iterations the function G}(2*) may achieve a minimum. On 
the other hand, some components of the estimation for the generalized gradient 
as calculated from equation (22.22) may still show a trend toward the origin, 
where another (at. least) local minimum would be found. Note that with a small 
€ the origin becomes a fixed point for the iteration : if 2°9 = 0 for one ag, then 
z® = 0 for all s > 89. To overcome these difficulties, the initial value z° should 
be large enough and the initial step multiplier » should be chosen such that the 
step size is a small fraction of z;. In this way a fallacious convergence towards 
zero during the first iterations can be avoided. To assess the behavior of the 
function G(x) at the various minima, a test function 





n ak 1 kon 
Gh(2*) =71 >> ae E >= > maxfa(2? — 12), (1? — 28)] (22.24) 
jy=0"F 6=0j=1 
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could be used. In this case m is a small integer, the choice of which slightly 
depends on the relative magnitude of a, f, and 7. 

Figure 22.5 shows the behavior of the functions G}(z*) and G?(z*) with 
increasing & for a = # = 0.5,7 = 5.0,m = 6. It is seen that G? (ak) is mono- 
tonically decreasing toward the global minimum while Gj(z2") has two local 
maxima. Table 22.3 shows the vector 2* at & = 180, which corresponds to one 
local minimum of Gi(2*), and at the end of the iteration (4 = 280). It cannot 
be proved that the solution obtained is the exact solution of the optimization 
problem. On the other hand, the computational effort that is needed for an 
estimation by the stochastic quasigradient method is also relatively small when 
compared to some integer programming methods, for instance. 


Gi (x*) 


250 =—= G2(x*) 


200 


150 


Objective Function, G! (x*), G? (x*) 





0 20 40 60 80 100 120 140 160 180 200 220 240 260 


Number of Iterations, k 


Figure 22.5 The behavior of G}(z*) and G?(z2*) as a function of k. 


The solutions obtained depend mostly on the relative magnitudes of a, 7, y. 
With increasing fixed costs, y, more facilities are likely to remain closed. When 
the # are increased the deficits are more penalized and thus more facilities 
remain open. Table 22.4 shows results from a sensitivity analysis on the values 
of a and #. The aim of the analysis is to find which values of a and # will 
cause the smallest facility (district 21) to disappear from the solution. This 
will happen almost certainly when # is less than 1.5. However, for a large 
range of values for 221, between zero and five, the objective function remains 
almost constant. Hence, with these parameter values, opening or closing that 
facility does not have great influence on the value of the objective function. 
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Table 22.8 Optimal location of Turin high schools. Solutions obtained after 
180 and 280 iterations with penalty costs a = 8 = 0.5 and fixed charge 7 = 5.0. 


k = 180 k = 280 





Table 22.5 shows the results of a sensitivity analysis on fixed charge 7. The aim 
of this analysis is to find the least value of 7 leading to a solution with a single 
facility open. 

The fixed charge is fixed and equal to 7 = 5.0. 

A few comments are appropriate here on the comparison between the de- 
terministic solutions, as determined in Leonardi and Bertuglia [10], and the 
solutions obtained with the stochastic quasigradient method. Some general 
tendencies are shared in common among all solutions, such as the low ranking 
of district 21 and the high ranking of district 11. The general clusters of open 
locations show also some similarity. A cluster of central districts (between 1- 
6), one of the first-ring districts (between 9-18) and a few peripheral districts 
(usually district 23 only) appear in deterministic solutions as well. However, 
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Table 22.4 Optimal location of Turin high schools. Results of a sensitivity 
analysis for changing values of penalty costs a and @. The fixed charge is fixed 
and equal to 7 = 5.0. 


District | §}=1.0 | G}=15 | G}=1.75 | GZ} = 2.0 











1 9.2 11.2 11.4 12.9 
2 9.8 11.0 11.4 11.4 
3 16.0 16.5 17.1 17.7 
4 16.4 17.1 17.5 17.3 
5 14.0 15.0 15.1 15.2 
6 11.1 12.0 12.0 12.1 
7 8.1 9.3 10.0 9.7 
8 8.3 9.0 9.0 9.0 
9 10.0 11.9 11.9 12.0 
10 17.0 18.3 18.5 18.7 
11 24.7 24.9 24.9 24.9 
12 18.0 18.1 18.5 19.0 
13 13.8 14.3 14,4 14.5 
14 13.8 14.0 14.0 14.0 
15 12.4 13.0 13.0 13.0 
16 11.6 12.0 12.3 12.4 
17 11.7 12.0 12.0 12.0 
18 13.9 14.1 14.6 14.9 
19 8.4 8.8 9.0 9.0 
20 7.0 8.7 9.0 9.2 
21 _ _ = 5.0 
22 6.4 8.0 8.1 9.0 
23 14.7 15.1 15.7 15.8 





when one looks at the detailed composition of these clusters, no two of them 
are the same. Sometimes very striking differences are found, such as the closing 
or opening of district 1 (the downtown district), which would be difficult to 
justify to a public authority. The main cause for such a lack of robustness of 
stochastic methods is the existence of many local minima and many near opti- 
mal solutions, with values of the objective function lying within a very narrow 
range. Of course a deterministic algorithm of an ennumerative nature can still 
detect small differences, even though it may take a long time. In a stochastic 
formulation, random fluctuations might well be of the same order of magnitude 
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Table 22.5 Optimal location of Turin high schools. Results of a sensitivity 
analysis for changing values of fixed change y. The penalty costs are fixed and 
equal to 7 = § = 1.5. 


District + = 10.0 








y= 15.0 








1 pa 
2 == 
3 13.9 
4 14.7 
5 12.5 
6 9.9 
7 _ 
8 = 
9 8.7 

10 16.2 

11 24.5 

12 18.6 

13 14.3 

14 13.9 

15 11.8 

16 11.8 

17 11.5 

18 13.1 

19 — 

20 _ 

21 = 

22 — 

23 14.1 





of the range of the objective function values. This seems to be the case in our 
examples. 
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22.7 Concluding Remarks 


The purpose of this study has been to consider the stochastic quasigradient 
method for solving a resource allocation problem. The main advantages of the 
method are undoubtedly its computational simplicity and the small amount of 
information required—explicit probability distributions are not needed, random 
observations from a Monte Carlo simulation process will do. 

The computational procedure for the basic recursion equation can be writ- 
ten down by using only a few program statements and the storage requirements 
of the method are minimal. The generation of the random observations, how- 
ever, may be time-consuming and hence the need for an algorithm made as 
effluent as possible. The standard step-size control is based on the interactive 
use of the computer and this normally guarantees that the solution is found after 
a moderate number of iterations. In this chapter some methods are presented 
that do not necessarily require continuous control from the person running the 
program and that often reduce the computation time. 

Test are also made for a case where the objective function is nonconvex. 
In the deterministic formulation, problems of this type lead to integer pro- 
gramming methods that are often slow, unless some special assumptions (like 
linearity) concerning the objective function and constraints are satisfied. Here 
the solution is based on the same iteration algorithm as in the convex case. The 
existence of several local minima may cause some difficulties with the control of 
the iteration process, but the experience shows that with regard to its simplicity 
and speed the method can be efficiently applied to obtain good estimates for 
the solutions of these difficult problems. 

The practical results of determining the size of school facilities in Turin 
were generally seen to be in agreement with the solutions derived by other 
means although differences in details were found. It is true that, given the 
special probability structure of equation (22.18), some deterministic algorithms 
could be used. However, these algorithms do not apply to more general cases, 
where the stochastic procedure might be advantageous. 
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CHAPTER 23 


LAKE EUTROPHICATION MANAGEMENT: THE LAKE 
BALATON PROJECT 


A.J. King, R.T. Rockafellar, 
L. Somlyédy, R.J-B Wets 


Abstract 


This is a brief overview of a collaborative effort of the Environment and Natu- 
ral Resources, and the Adaptation and Optimization task forces at IIASA, to 
design stochastic optimization models for the management of lake eutrophica- 
tion, and its use in a major case study (Lake Balaton). For further details, 
consult: Somlyédy [5],[6]; Somlyédy and van Straten [8]; Somlyédy and Wets 
[9]; Rockafellar and Wets [2]; and King [1]. 


Lake Balaton (Figure 23.1), one of the largest shallow lakes of the world, 
which is also the center of the most important recreational area in Hungary, 
has recently exhibited the unfavorable signs of artificial eutrophication. An 
impression of the major features of the lake-region system (including phosphorus 
sources and control alternatives) can be gained from Figure 23.1 (for details, see 
Somlyédy et al [7]; and Somlyédy and van Straten [8]). Four basins of different 
water quality can be distinguished in the lake (Figure 23.1) determined by the 
increasing volumetric nutrient load from east to west (the biologically available 
phosphorus load, BAP, is about ten times higher in Basin | than in Basin IV). 
The latter is associated to the asymmetric geometry of the system, namely the 
smallest western basin drains half of the total watershed, while only 5% of the 
catchment area belongs to the larger basin. 

Based on observations for the period 1971-1982 the average deterioration 
of water quality of the entire lake is about 10% (in terms of Chlorophyll-a (Chl- 
a)). According to the OECD classification, the western part of the lake is in 
a (most advanced) hypertrophic state (which is the result of the large nutrient 
load), while the eastern portion of it is in an eutrophic stage. 

The modeling approach to eutrophication and its management involved 4 
major phases (Somlydédy [5]). 


1. The description of the dynamics of the lake eutrophication process by a 
simulation model (LEM) which has two sets of inputs: controllable inputs 
(mainly artificial nutrient loads) and noncontrollable inputs (meteorologi- 
cal factors, such as temperature, solar radiation, wind, precipitation). The 
output of the model is the concentrations vector y of a number of water 


‘suotjdo [019}U09 pue sedmmos quetTynu Jofepy Toy aang 





$ jy) Wy OZ ol oO 
ne fi vss 
3 ; \ 
§ ae ‘ [p/64] 2¥87 
8 i 
3 \, [p/54] +7 
a | saiueingii) wabie] A 
i) i 
‘8 ease jeuoleasoas uoeyeg | (p/6y] 4V87 = +7 ,) 
3 ayi jo Agepunog  ==== sty sabieyosip abemas uoleyy @ 
ra) i \, 
& : i eee 
a a 
: \ an 
\ \ ' 
ys ) 
¢ I ! 
, oy ! 
jsouiszs g : I 
<ll gor *” (Co) Axe 
eece Ore BazsuaBaejez ~ “V SA 
(MO|}3NO ayBuis) lt a. 
eyed 


jeueg IS DI oct 


eT 

(ve) $= 
eAEIEXY | 3 “ 
~ 7 4% 


436 


Lake Eutrophication Management 437 


quality components as a function of time (on a daily basis) t, and space 
rt :y(t,r). LEM is calibrated and validated by relying on historical data. 

2. Derivation of stochastic inputs and the usage of LEM in a Monte Carlo 
fashion under systematically changed load conditions resulting in water 
quality as a stochastic variable: 7(t,r). Selection of the indicator for water 
quality management: for Lake Balaton the annual peak value of (Chi-a) 
was found to be appropriate. The use of (Chl-a)max.as the indicator allows 
to eliminate time from the analysis on the level of management. 

Derivation of the aggregated, stochastic load response model (LEMP) serv- 

ing the indicator as a function of the load (for Lake Balaton a linear re- 

lationship was obtained). Design of a planning type nutrient load model 

(NLMP) and the incorporation of LEMP and NLMP in a management, 

optimization model (EMOM). 

4, Validation. In the course of this procedure various simplifications and 
aggregations are made without a quantitative knowledge of the associated 
errors. Accordingly, the last step in the analysis is validation. That is, 
the LEM should be run with the “optimal” load scenario (found in the 
previous step), and the “accurate” and “approximate” solutions generated 
by the aggregated and nonaggregated versions of LEM can be compared. 


ho 


The lakes’ total P is in an average 315t/yr (the BAP load is 170t/yr); but 
depending on the hydrologic regime it can reach 550t/yr. 53% of the load L is 
carried by tributaries (30% of which is of sewage origin—indirect load, see e.g., 
the largest city of the region, Zalaegerszeg in Figure 23.1), 17% is associated 
to direct sewage discharges (the recipient is the lake). Atmospheric pollution 
is responsible for 8% of the lake’s load and the rest comes from direct runoff 
(urban and agricultural). Tributary load increases from east to west, while the 
change in the direct sewage load goes in the opposite direction. The sewage 
contribution (direct and indirect loads) is 30% to P, while it is about 52% to the 
total biological available load (the load of agricultural origin can be estimated 
as 47% and 33%, respectively) suggesting the importance of sewage load from 
the viewpoint of the short term eutrophication control. Figure 23.1 indicates 
also the loads of sewage discharges and tributaries which were involved in the 
management optimization model. These cover about 85% of the nutrient load 
which we consider controllable on the short term (e.g. atmospheric pollution 
and direct runoff are excluded). 

Control alternatives are sewage treatment (upgrading of the biological stage 
and introduction of P precipitation) and the establishment of prereservoirs as 
indicated in Figure 23.1 (see e.g. the Kis-Balaton reservoir system planned for 
a surface area of about 75 km’). 

The nutrient load model for Lake Balaton incorporates control variables 
associated with control options mentioned. Sewage load was considered deter- 
ministic, while tributary load was modeled by the simple relationship. 


L= (Zo +a1Q+L,)(€ +6) 
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where Lo is the base load (mainly of sewage origin), Q is the stream flow rate, 
L, is the residual, and the variable € accounts for the influence of infrequent 
sampling (€~ is the lower bound). The most detailed data set including 25 
years of continuous records for Q and 5 years of daily observations for the loads 
was available for the Zala River (Figure 23.1) draining half of the watershed 
and representing practically the total load of Basin 1. For the Zala River L, was 
found to have a normal distribution, while Q was approached by a lognormal 
distribution. Tributary load can be controlled by choosing the size of reservoirs 
(they generally consist of two parts, having separated impacts on dissolved and 
particulate loads, see Figure 23.1), while the Lo component can be influenced by 
sewage treatment. As can be judged from the above equation, sewage treatment 
affects the expectation of the load, only, while reservoirs affects both expectation 
and variance (for details see Somlyédy [6]). 

The planning type nutrient load model (NLMP) outlined briefly and the 
linear load response model (LEMP) lead to the affine relation (Somlyédy and 
Wets [9]) 

y(z,w) = T(w)z — h(w) 


where y = (y1,...,y4) are the water quality indicators in Basins 1,...,4, the 
random vector h incorporates all noncontrollable factors, the z-variables are 
the control variables and the linear transformation T(w)zx gives the effect on 
water quality of the measures taken to control the loads L. 

In the formulation of the eutrophication management optimization model 
(EMOM) the objective must be chosen so as to measure in the most realistic 
fashion possible the deviations of the indicators from the water quality goals. 
This led us to a stochastic program with recourse model with associated solution 
procedure developed by Rockafellar and Wets [8] and implemented by King [1]. 
We also used a linear programming model, see Somlyédy [6] and Somlyédy and 
Wets [9] (Section 6) that is based on expectation-variance considerations (for 
the water quality indicators). In the Lake Balaton case study the results for 
both this expectation-variance model and the stochastic programming model 
(5.11) lead to remarkably similar investment decisions. Subsequently, objective 
functions and results of the two models are briefly discussed. 


1. The recourse formulation starts from the following considerations. The 
model should distinguish between situations that barely violate the desired 
water quality levels (y;, ¢ = 1,...,N) and those that deviate substantially 
from these norms. This suggests a formulation of our objective in terms of a 
penalization that would take into account the observed values of (y;(z, w) — 14) 
for; =1,...,4. 

We found that the following class of functions provided a flexible tool for 
the analysis of these factors. Let 9: R — R, be defined by 


0 ifr <0 
G(r) = tr? f@O<r<1 
r—i ifr >1 
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This is a piecewise linear-quadratic-linear function. The penalty functions 
(¥;, «=1,...,N) are defined through: 


¥, (2) = qies8 (e; *2;) for =1,...,N, 


where gq; and e; are positive quantities that allow us to scale each function 
WY; in terms of slopes and the range of its quadratic component. By varying 
the parameters e; and g; we are able to model a wide range of preference 
relationships and study the stability of the solution under perturbation of these 
scaling parameters. 

The objective is thus to find a program that in the average minimizes the 
penalties associated with exceeding the desired concentration levels. This leads 
to the following formulation of the water quality management problem: 


find 2z€ R" such that 
0<2; <7;, g=l,...,n 


n 

SS acjz; S bi, 4=1,...,m) 

n 

De tis(w)2y — vale) = he(w) 8 = 1... 4ms 
J=1 


n 
d; Salytheed 
and z =>) (se +2 + Peat i) +5(5" qreé *u;(w))} is minimized 
j=l f=1 
to which one refers as a quadratic stochastic program with simple recourse; here 
b; is the available budget that we handle as a parameter. For problems of 
this type, in fact with this application in mind, an algorithm is developed in 
Rockafellar and Wets [2], and Rockafellar and Wets [8], which relies on the 
properties of an associated dual problem. In particular it is shown that the 
following problem: 
find y€ RP and z(-):Q— R™? measurable such that 


0<2z(v) <a, ¢=1,...,mgq 


my mg 
Uz = ty — d. Oi Yi abe zi(w)tizj(w)}, g=l,....2 


and ae - SEC (w) 2¢(w) + aa? ()} 
i=1 i=1 
- 3 rjd;0(d; *u;) is maximized , 
j=l 
is dual to the original problem, provided that for 7 = 1,...,m2, the e; and q; 


are positive {and that is the case here) and for 7 = 1,...,7, the d; > 0, which 
is taken care of by a natural perturbation of the objective. 
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An experimental version of this algorithm that relies on MINOS was im- 
plemented at IIASA by A. King (and is available through IIASA as part of a 
collection of codes for solving stochastic programs), see King [1]. It starts the 
procedure by solving the deterministic problem with expected values for the 
coefficients in A and 7. 


2. As a starting point for the construction of the expectation-variance 
model, we consider the following objective function: 


N 
dwF{ui(2,") —1)45 


where, as earlier, y;(z,w) is the water quality indicator characterized by the 
selected indicator in basin ¢ given the investment program z and the environ- 
mental conditions w, 7; the goal set for basin 7 and g; a weighting factor. The 
objective being quadratic in the area of interest, and the distribution functions 
Gi (z,-) of the y;(z,-) not being too far from normal, one should be able to 
recapture the essence of the effect of this objective function on the decision 
process by considering just expectations and variances of the y,(x,-). This ob- 
servation, and the “soft” character of the management problem, suggest that 
we could substitute for the original objective 


Ya (E{yi (2, ‘) = Joi} + ba(yi (2, :) = Joi)) 


i=1 


where 6@ is a positive scalar (usually between 1 and 2.5), 9; = E{yo;} is the 
expected nominal state of basin 7, and o denotes standard deviation, 


a lyler) 9a) = Bl ilar) = Bilal}. 


Since for each i =1,...,N, the y; are affine (linear plus a constant term) with 
respect to x, the expression for 


E{y;(2,) — Goi} = yee + fio 


j=l 


as a function of 2 is easy to obtain from the load equations. The y,; are 
the expectations of the coefficients of the z; and the 4;. the expectation of the 
constant term. Unfortunately the same does not hold for the standard deviation 
o(yi(z,-) — Jor). The nutrient-load model suggest that 
1 


o(y;(x,°) a Gor) ~ ‘> Ojgup)? 
é 


where o;¢ is the part of the standard deviation that can be influenced by the 
decision variable z¢; for example, the standard deviation of the tributary load. 
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Cross terms are for all practical purposes irrelevant in this situations since the 
total load in basin 7 is essentially the sum of the loads generated by various 
sources that are independently controlled. This justifies using 


as an objective for the optimization problem. This function is convex and 
differentiable on R? except at = 0, and conceivably one could use a nonlinear 
programming package to solve the optimization problem: 


find z € R® such that 


ro SapSrp felon 
n 
So aizary <b) t=1,...,m 
j=l 
N n n t 
and z = Yai De Mize; +0 S032 | is minimized. 
=1 j=! j=l 


One can go one step further in simplifying the problem to be solved, namely 
by replacing the term. 


a bi 
222 
re OF; 2; 
j=l 


in the objective, by the linear (inner) approximation 


n 

s. Oj;X;. 

j=l 
On each axis of R?, no error is introduced by relying on this linear approx- 
imation; otherwise we are over-estimating the effect a certain combination of 
the 2’. will have on the variance of the concentration levels. Thus, at a given 
budget level we shall have a tendency to start projects that affect more strongly 
the variance if we use the linear approximation, and this is actually what we 
observed in practice. Assuming the cost functions c; are piecewise linear, we 
have to solve the /inear program: 


find 2z€ R” such that 


= + ae 
r; Sa; Sr;, jg=l,...,7 
n 


Yo aijay Sb, g=1,...,my 
j=l 
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N n 
and t = wr: Yo (ui + 60;;)2; is minimized. 


f=1 j=l 


We refer to this problem as the (linearized) expectation-variance model. 

We have given only a heuristic “justification” for the use of the expectation- 
variance model as a management tool. In Section 6 of Somlyddy and Wets [9], 
this model is also derived from a basic formulation of the management problem 
that integrates reliability and penalty considerations. 


3. Figures 23.2 and 23.3 give a comparison of the results for the recourse 
and the expectation-variance models when we vary § (the budget level). Statis- 
tical parameters (expectation, standard deviation and extremes) of the water 
quality indicators gained from Monte Carlo procedure are illustrated in Figure 
23.2 for the Keszthely basin as a function of the available budget £. 

In Figure 23.3, we record the changes in the two major control variables 
(tsni and xp) associated to the treatment plant of Zalaegerszeg and the (sec- 
ond) reed lake segment of the Kis-Balaton system (see Figure 23.1). There is a 
significant trade-off between these two variables. For decision making purposes, 
it is important to observe that there are four ranges of possible values of #7, in 
which the solution has different characteristics. 


Y, f © Expectation variance model 

= * Stochastic model with recourse 

{mg/m? ¥, = (48, 28, 24, 18), 1=(1,...,4) 
e, = 5.0, q,= 10.0, i= (,..., 4) 


100 





max {Y,} 

50 95% confidence level 
EtY,) 

min {Y,} 


Pap ie pp ne 9) 
! = ——» 


B > TAC [10’ ft/yr] 








Figure 23.2. Water quality indicator (Chl — a),x as a function of the total 
annual cost. 
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Figure 23.3. Change of major decision variables. 


As seen from Figures 23.2-23.3, the two models produce practically the same 
results in terms of the water quality indicators (including also their distribu- 
tion). With respect to details there are minor deviations. According to Figure 
23.3, the expectation-variance model gives more emphasis to fluctuations in 
water quality and consequently to reservoir projects, than the stochastic re- 
course model (see the basic case, B, with the parameters specified), and this is 
in accordance with the fact that the role of the variance is overstressed in the 
expectation-variance model. 

From this quick comparison of the performances of the two models, we 
may conclude that the more precise stochastic model validates the use of the 
expectation-variance model in the case of Lake Balaton. 

A more detailed analysis, and further discussion on the role of parameters 
7,e; and g;, and comparison between deterministic models and the stochastic 
models is given in Section 8 of Somlyédy and Wets [9]. 
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CHAPTER 24 


OPTIMAL INVESTMENTS FOR ELECTRICITY 
GENERATION: A STOCHASTIC MODEL AND 
A TEST-PROBLEM 


F. V. Louveaux and Y. Smeers 


Introduction 


In this chapter, we study the problem of optimal investments for electricity 
generation. We discuss the reasons which justify the use of a multistage sto- 
chastic model and present a formulation for such a model. We then propose a 
two-stage test-problem derived from this model. 


24.1 The Problem 


Among the various problems related to electricity generation, we consider here 
the investment problem which consists in finding optimal levels of investment in 
various types of power plants so as to meet future demands, see Anderson [1]. 
Three properties of a given power plant 7 can be singled out in a static analy- 
sis: the investment cost c;, the operating cost g; and the availability factor a; 
which indicates the percent of time the power plant can effectively be operated. 
Demand for electricity can be considered to be a single product, but the level 
of demand varies over time. The electricity producers usually represent the 
demand in terms of a so-called “load-curve” which describes the demand over 
time in decreasing order of demand level (Figure 24.1). Since we are concerned 
here with investments over the long run, the load curve we consider is taken 
over a year. 


The load curve can be approximated by a piecewise constant curve (Figure 
24.2) with & segments. Let dj = D,,d; =D;—D,-1, j =2,...,% represent 
the additional power demanded in the so-called “mode 3” for a duration T;. 
Note that in order to obtain a good approximation of the load curve, it is 
necessary to consider large values of k. In the static situation, the problem 
consists in finding the optimal investment for each mode j, i.e. that one which 
minimizes the total cost of effectively producing 1 MW of electricity during the 
time T;. 

i(7) = argmin { ot atsy (24.1) 


i=1,n a; 


where 7 is the number of available technologies. 
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Demand 


Time 


Figure 24.1 





Figure 24.2 


The above static model captures one essential feature of the problem namely: 
that base load demand (associated with large values of 7;, i.e. small indices ) 
is covered by equipment with low operating costs (scaled by availability factor) 
while peak-load demand (associated with small values of T;, i.e. large indices 
j) is covered by equipment with low investment costs (also scaled by their avail- 
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ability factor). For the sake of completeness, note that peak-load equipment 
should also offer enough flexibility in operations. 


24.2 A multistage model 


At least four elements justify considering a dynamic or multistage model for 
the electricity generation investment problem: 


- the long-term evolution of equipment costs 

- the long-term evolution of the load curve 

- the appearance of new technologies 

- the obsolescence of presently available equipment. 


The equipment costs are influenced by technological progress but also (and for 
some drastically) by the evolution of fuel costs. 

Of significant importance in the evolution of demand is the total energy 
demanded (the area under the load curve) but also the peak-level Dy, which 
determines the total capacity that should be available to cover demand. The 
evolution of the load curve is commanded by several factors including the level 
of activity in industry, energy savings in general as well as electricity producers 
tariff policy. 

The appearance of new technologies depends on the technical and com- 
mercial success of research and development while obsolescence of available 
equipment depends on past decisions and technical life time of equipment. 

All these elements together induce that it is no longer optimal to invest 
only in view of the short-term ordering of equipment given by (24.1) but that 
a long-term optimal policy should be found. 

The following multistage model can be proposed. Let 


nm = number of technologies available 

new capacity made available for technology 7 at time ¢ 
total capacity of 7 available at time ¢ 

a; = availability factor of 1 

L; = life-time of ¢ 

g; = existing capacity of 1 at time t, decided before t =1 
d'. = maximal power demanded in mode J at time t 


T; = duration of mode j at time t 
yij' = capacity of ¢ effectively used at time ¢ in mode 7 


c} = unit investment cost for 7 at time ¢ (on a yearly equivalent basis) 
g = unit production cost for 7 at time t 


The electricity generation N-stage problem is 


main [ee 8; ie Th yist (24.2) 


= t=1 f=1 fst 
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subject to 


tL; 


eae tact—a, t=1,...,n, t=1,...,N (24.3) 


i 
n 
yt =F g=l,...,k, ¢=1,...,N (24.4) 
=I 


k 
Sout; < ai(gt +e) i=1,...,n, £=1,...,N (24.5) 
j=l 

z,y,6 20 


Decisions in each period t involve new capacities z' made available in each 
technology and capacities ut; operated in each mode for each technology. 

Newly decided capacities increase the total capacity s¢ made available, as 
given by (24.3) where account is also taken of equipments becoming obsolete 
after their lifetime. We assume z7 = 0 if r < 0, so equation (24.3) only involves 
newly decided capacities. 

By (24.4), the optimal operation of equipments must be chosen in such a 
way as to meet demand in all modes, using available capacities which by (24.5) 
depend on capacities gt decided before t = 1, newly decided capacities s{ and 
the availability factor. 

The objective function (24.2) is the sum of the investment plus mainte- 
nance costs and operating costs. Compared to (24.1), availability factors are 
taken care of in constraints (24.5) and do not need to appear any more in (24.2), 
the operating costs are exactly the same and are based on operating decisions 
Yi 5s while the investment annuities and maintenance costs cf apply on the cu- 
mulative capacity s'. Placing annuities on the cumulative capacity, instead of 
charging the full investment cost to the decision 2‘, simplifies the treatment 
of end effects and is currently used in many power generation models. It is a 
special case of the salvage value approach, see e.g. Grinold [3]. 


24.8 A stochastic model 


The same reasons that pleaded for the use of a multistage model can be advo- 
cated to motivate resorting to a stochastic model. The evolution of equipment 
costs, in particular fuel costs, the evolution of total demand, the date of ap- 
pearance of new technologies, even the lifetime of existing equipments can all 
be considered truly random. We first present a basic model taking the uncer- 
tainty about demand and costs into account leaving the other two aspects for 
the discussion. 

The main difference between the stochastic model and its deterministic case 
is in the definition of the variables z! and s‘. In particular, z! now represents 
the new capacity of 2 decided at time t, which becomes available at time ati 
where A; is the construction delay for equipment 7. In other words, to have 
extra capacity available at time ¢, it is necessary to decide at t — A,, when less 
information is available on the evolution of demand and equipment costs. This 
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is especially important since it would be preferable to be able to wait till the 
last moment to take decisions that would have immediate impact. 

Another consequence of delay factors and uncertainty is the fact that the 
model loses its relatively complete recourse property. This means any choice 
of investment decisions does not yield a feasible operations policy. To restore 
the relatively complete recourse property, it is necessary to assume that there 
exists a technology with high operating costs and zero construction delay. For 
any period ¢ and any realisation € of the random event, an investment is made 
in that technology, which for simplicity is always supposed to be technology 
n, if the level of capacity investments in the previous periods is insufficient to 
cover present demand. 

Let 


= new capacity decided at time ¢ for equipment 2, 7=1,...,” 
= total capacity of 7 available plus in order at time ¢ 

nm =a technology such that A, =0 

& = represents the random variable at time é 


and the other variables be as before. Then the stochastic model is 


N n n ok 
min Ee >| Doel + ae Test, (24.6) 
f=1 \s=1 f=1 y=1 
ofa ap—z (24.7) 
n 
Yul; ad; (24.8) 
ml 
. t t t-A; 
Dov; Ss ailgi +9; *) (24.9) 
j=l 
n-1 t-A 
an (of, +051 +24) 2 DS, - Yo ai (of +5) (210) 
f=1 
#,2,y 20 


The elements forming & are essentially the demands (d{,...,d‘,) and the costs 
(cf, q‘). The decision vectors (z‘,s‘,y') are conditional on the realizations 
(€1,..-&). The above model has fixed recourse since W and T are fixed and 
telatively complete recourse thanks to inequality (24.10). In most cases, when 
periods represent several years, typically five, and N is small enough, equation 
(24.7) can be simplified into 


t _ jt-1 t 
a= 8, +42,. 
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If one wants to consider the date of appearance of a new technology 27 to 
be a random event, the easiest way is to add constraints of the form 


af < nt Uj 


where u; is a fixed upperbound on the investment in any period t, 7; = 
(n},.. ae) is a stochastic vector, whose components are zero and one, and 
such that nt? > nist =1,...,N —1. This permits to maintain to the model 
a fixed recourse structure. 

On the other hand, if the availability factors or the life-time are random, 
then the model no longer possesses the fixed recourse property. 


24.4 Techniques of Solution 


Techniques of solution used in Louveaux [4] and Louveaux and Smeers [5] to 
solve (24.6)—(24.10) are based on two observations. First, an accurate approxi- 
mation of the load curve by a piecewise constant curve, as was done in Section 
1, requires the use of many different modes ( & = 20 to 40, typically). This 
in fact induces that the size of the model becomes very large. The alternative 
procedure proposed in [5] is to use a piecewise linear approximation such that 
a limited number of pieces suffice to adequately describe the load curve. Then, 
the objective function in (24.6) becomes quadratic in the yf ;'8. 

The second observation is that the above model possesses the block-separa- 
bility property, discussed in [4]. This means that decisions on the operations 
variables oi ss for a given €;, can be taken independently of investment decisions 
zt of the same period for the same €;, and moreover that the operations variables 
yf, do not influence in any way the choice of subsequent variables 27 for r > ¢+1. 
The details of how to handle the special case of the technology n with zero 
construction delay for which the decision z/, can influence yj,; for the same 
period are explained in [4]. Using these techniques, problems running over 5 
periods and having up to 32 final random realisations have been solved, see [5]. 


24.5 Test Problem 


In this section, we present a two-stage linear version of (24.6)—(24.10) with 
stochastic right-hand side only, and we discuss the reasons which make this test 
problem interesting. 
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24.5.1 The example 

The example is a two-stage linear version of (24.6)-(24.10), with 3 operating 
modes, 4 technologies, one period construction delay for all technologies, and 
no equipment available, so g = (0,0,0,0). We also assume d; = 2,d) = 3 and 
d, = €, where € can take the value 3, 5 or 7 with probability .3, .4 and .3 
respectively. Moreover 72 = .67; and T3 = .1T,,; we assume TJ; = 10. Since 
N = 2 and all equipments have a one period construction lead time, (24.7) 
reduces to s{ = z!, so the variables 6 are suppressed from the formulation 
and the index ¢ can be omitted. The constraint (24.10) takes the simple form 

ao 2; > 12 where 12 = maxg € + do + dg. 

An upper bound is placed on the budget spent on the first period. The 
investment costs for the four equipments are (10, 7, 16, 6) respectively. Assum- 
ing T; = 10, the operating costs in mode 1 are (40, 45, 32, 55). Then, if T; = 6 
and T3 = 1, one obtains the following model. 


z=minl0z) +723 + 1623 + 624 + Ee min(40y11 + 45y21 + 32y31 + 55y41 
+24y12 + 27y99 + 19.2439 + 33y49 
+4y13 + 4.5y93 + 3.2y33 + 5.5y43) 


subject to 


2) +29 +23 +24212 yir tyiatyi3a S 2% 
102; + 729 + 1623 + 624 £120 yo) +ya2 + Yy23 < 22 
t>0 yai +ys2 +y33 XS #3 
y4i +yao tys3 S04 
yir + ya1 + Yai + y41 2 & 
yi2 + y23 +y32 +Yy42 23 
yia + yaa + y33 + yas 2 2 
y¥ 20 


where € can take the value 3, 5 or 7 with probability 0.3, 0.4 and 0.3 respectively. 

The optimal solution is given by 21 = 8/3;22 = 4;23 = 10/3324 = 2 with 
objective value z = 381.853. lt was obtained by using Birge’s NDST3 program, 
see [2]. 
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24.5.2 Use of the example 

One great quality of the above example is that optimal second-stage decisions 
are easy to derive. This is an interesting feature for the design and verification 
of a new algorithm or computer code. The same property can also be used to 
illustrate the advantages of block-separability in multistage programs, see [4]. 

We now indicate how the second-stage decisions can be used to obtain one 
cut of the L-shaped method, see Van Slyke and Wets [6] and Birge [3] for a 
multistage version. 

In the above example, the optimal second-stage decisions, conditional to 
some realization of €, can be obtained by a simple rule, called the “order of 
merit rule”, which states that it is optimal to operate the equipments in the 
order of increasing operating costs. 

To illustrate this, take the example where € = 5 and 2, = 8/3; 22 = 4523 = 
10/3; 24 = 2. Following the order of merit rule, the cheapest equipment, namely 
equipment No. 3, should be used first, i.e. in mode 1, up to the available 
capacity; since 3 = 10/3 < dy, it follows that y3, = 23 (this is valid as long as 
v3 < 5). 

The second cheapest equipment in terms of operating costs is equipment 
No. 1, hence yj; = 5 ~— 2g and y,2 = 2, — (5-23) = 2) +23 —5. 

In mode 2, in addition to equipment 1, it is necessary to operate No. 2 as 
follows: ya. = 3 — (21 +23 — 5) = 8 — 2, — 23 and finally yg = 2. From this, 
we derive the value of the second stage for € = 5. 


Q(z, € = 5) = 3223 + 40(5 — 23) + 24(21 + 23 — 5) + 27(8 — 21 ~ 29) 
+ 4.5.2 = 305 — 32, — 1123 


Similarly, for € = 3, one obtains the second-stage optimal solution 
Y¥a1 = 3, Yao = 23 — 3,413 = 21, ¥22 = 6 — & — 2g, and y23 = 2, 
hence the optimal value of the second-stage 
Q(z, € = 3) = 209.4 — 32, — 7.829. 
Finally, for € = 7, one optimal solution is 
Ya1 = 3,911 = 21,991 = 7 — 21 — 23,423 = 3, 
¥a3 = 2, +22 +23 —10 and y43 = 12 — 2, — 23 — 2, 80 
Q(z, € =7) = 417 — 62; — xq — 4623. 

Given the probabilities associated to € = 3, 5, and 7, one obtains 

Q(z) = EeQ(21, €) = 309.92 — 3.92, — 322 — 20.5423. 


Hence, the related cut in the L-shaped method of Van Slyke and Wets [6] would 
be 
6 > 309.92 — 3.92, — 0.329 — 20.5423. 
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CHAPTER 25 


SOME APPLICATIONS OF STOCHASTIC OPTIMIZATION 
METHODS TO THE ELECTRIC POWER SYSTEM 


C. Nedeva 


25.1 Introduction 


The electricity generation, distribution and consumption network is a complex 
system involving a large number of power sources and consumers, electricity 
transmission lines, transformers, and so on. The technical and economic char- 
acteristics of its components depend on a number of factors: the amount of 
electricity consumed depends on the introduction of new consumers, on the use 
of new techniques, on the time of day, the season, etc; the volume of production 
and the price of electricity depend on the local hydrometeorological conditions, 
and on the quantities and prices of the available resources etc. 

Various types of problems arise in this system: forecasting problems, prob- 
lems of engineering design, exploitation problems, etc. Most of the resulting 
mathematical problems are problems of optimization under uncertainty, since 
it is usually impossible to predict precisely what will happen in the future. A 
typical design problem is described in the next section. 


25.2 Determination of the Optimal Parameters of a Super Conduct- 
ing Power Cable Line (S.C.P.C.L.) 


The aim is to minimize the total cost of the construction, exploitation and sup- 
port of an S.C.P.C.L. We consider an S.C.P.C.L. of fixed construction with a 
coaxial disposition of the current-carrying and shielding superconductive ele- 
ments. This type of construction allows us to express the total cost by means of 
the following parameters: the dimension z, the nominal tension y and the num- 
ber z of cables in one line. The cost also depends on a number of factors whose 
values are determined theoretically or experimentally and not known precisely. 
The exploitation costs of an S.C.P.C.L. depend on the transmitted power, the 
non-uniformity and amplitude of the graph of the load, the number of switch- 
offs, the temperature of the surroundings; and parameters whose values may 
vary during the operation of the S.C.P.C.L. The construction costs depend on 
the prices of the materials and the labor costs, which are not. known precisely 
at the time of design—optimistic, pessimistic and most probable values are 
provided by expert economists. 
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The parameters whose values are not known will be denoted by (w1,...,we) 
=w (in our problem £ ~ 30), and we shall assume that w is a random vector 
with distribution function H. Thus the total cost of the $.C.P.C.L. may be 
expressed by a very complicated function [1], indeed, so complicated that it is 
not feasible to discuss it here. We shall denote this function by f(z, y,z,w) and 
record only its essential properties: f is measurable on w, differentiable and 
strictly convex on 2 for every y,z,w. The variables y and z are discrete and 
may take a certain (not too large) number of values. The sets of feasible values 
of y and z will be denoted by Y and Z, respectively. 

For given y € Y and z € Z, the S.C.P.C.L. should possess a steady-state 
stability margin, which is expressed by the condition 


Wigt S cy*z, 


where c is a known constant and w,, is the component of w corresponding to 
the transmitted power. 

Since the parameters z,y,z should be determined and fixed before the 
construction and exploitation of the S.C.P.C.L., a reasonable optimization cri- 
terion is the “minimization of the mean total cost” with the requirement that 
the steady-state stability condition is satisfied with sufficiently large probability 


Po- 
We thus arrive at the following mathematical model: minimize the function 


Fu(2,y,2) =Ef(z,y,z,) = f seu,2t)aHT() (25.1) 

subject to 
P(wi2 <cy*z) > po, (25.2) 
as2<byEY,zEZ. (25.3) 


Let us assume that the distribution function H is known. 

We therefore have to solve a partially-discrete stochastic programming 
problem. The number of feasible combinations of parameters y and z does 
not exceed 15, and for fixed y € Y and z € Z, the minimization problem of 
the function (25.1) with respect to z subject to (25.2), (25.3) is easily solved by 
means of a method described below. Enumeration on the discrete parameters 
y and z is fully acceptable and we shall describe a method for the minimiza- 
tion with respect to z of the function (25.1) subject to (25.2), (25.3), for fixed 
¥ = yo,2 = 20. We note first that conditions (25.2) and (25.3) together are 
equivalent to the condition 


tEX°={zla<z¢<a*} (25.4) 
where a* = maxa such that a < a < } and 


P(cyjz0/wi, < a) <1—po. 
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The distribution function of the random variable wi, is known and the problem 
of finding the right bound a% is easily solved, for instance, by the golden section 
method. 

The problem (25.1) and (25.4) may be solved by means of stochastic quasi- 
gradient (SQG) methods. We choose an initial point 2° € X°. Suppose we have 
arrived at a point 2* after & iterations. Then we choose the point w* in accor- 
dance with the distribution function H and construct the point 


gktl — max(a, min{a*, 2* = a(f(2* + 719° 2°) =" f(2*,y°,2°,w*))}] 


where + > 0 is a constant (in our computations a reasonable choice for + turns 
out to be 7 = 1075). 
As a stopping rule we use the inequality 


1 k+r k+2r 
= Sf(2’,y°, 2°?) - > f(2?,y°,2°,w*)] < 0.001. 
ak 6=k+r 


Optimal parameters were determined for many different sets of input data. The 
average number of iterations was 510. Table 25.1 gives the main parameters for 
one particular set of input data. 


Table 25.1 The value and type of the main parameters for one set of input 
data 


Parameter Value(s) /distribution Type 
Length of the S.C.P.C.L. 30 (km) Fixed 
Feasible set for the number 

of cables in one line (Z) {2,3,4} Fixed 
Feasible set for the nominal {10, 20, 40, 60, 

tension (Y) 90, 110} (kv) Fixed 
Transmitted power (w;,) sharply normal in [400,600] Random 


Ew;, = 500 (MVA) 
Required probability for 
the steady-state 
stability condition (po) 0.95 Fixed 


The optimal solution for the example given above was obtained as follows: 


- dimensional parameter z* = 0.146 
- number of cables in one line z* = 2 
- nominal tension y* = 60 (kv) 


The mathematical expectation of the total cost is (approximately): F = 628.6 
(Lvs/m). The results obtained by this method were compared with some known 
optimal solutions, and matched them quite closely. 
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When formulating the mathematical model we assumed that the distribu- 
tion functions of the random variables are known. However, an analysis of the 
available information showed that these distributions can be determined only 
within given classes of distribution functions characterized by some moments 
or intervals for these moments. Thus, after the unknown parameters have been 
determined and the values z*, y*, z* have been found, we have to consider how 
the value of the objective function Fy (z*,y*,2*) changes when the partially 
known distribution is varied within the given class of distribution functions. 
This is, to some extent, a problem of sensitivity analysis with respect to those 
parts of the distribution functions which are only partially known. 

In the problem under consideration, certain random variables, such as the 
transmitted power, the temperature, etc., possess well-defined distribution func- 
tions. Other random variables, such as material costs , the nonuniformity and 
graph of the load, etc., have distribution functions that are only partially known. 
Thus, the basic problem is to determine the bounds of the objective function as 
the distribution functions vary between pessimistic and optimistic estimates. 

Such problems can be described in formal terms as follows. Let us denote by 
7 €1 the group of random variables whose distribution functions are partially 
known and by g°{7) the value of the objective function as a function of 9 for 
fixed optimal parameters. The distribution function H of 7 belongs to class K, 
defined in the following way: 


[ omar $a;, €=1,...57 
2 


[ano - 


with given constants a;,i = 1,...,7. ln order to determine the range of possible 
values of G° (H) as the distribution H varies in K, we have to solve the extremal 
problems 


0 
min G (H) (25.5) 
0 
max G (4), (25.6) 


where 
G°(H) = fe (t)dH (t). 


Numerical methods for solving such problems are given in [8]. Since the 
extremal distributions are not of importance, we can use the so-called dual 
approach, when under rather general assumptions 


0/77) = 
max G (eS a (De Uk +max|9° (v) - Dad (v)]} 


where 
= {ue Ru; > 0, a=1,...,r}. 
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An analogous proposition holds for problem (25.5). We can then use the sto- 
chastic procedure described below (see [5]). 

We start by choosing points u® € Ut, v® € 2, and suppose that after 8 
iterations we have arrived at (u*,v°). Then we generate a point 7° in accordance 
with the uniform measure on (7, and determine 

ytl= { v, if g°(v°) — Duy ueg*(v) < 9°(8*) — Dj, uaa" (9°) 

#* otherwise 
we then compute 


1 . . 
uot! = max{0, uf — = (a; — 9'{v?*"))}, t=... yt 


When implementing this method we used the inequality 

1 ktror 

= u*(a; — g'(v**?))| < 0.01 

|; Ld, i ( g ( I 
as a stopping criterion. The computational results obtained by this method 
showed a great variation in the degree to which the value of the objective 
function depends on the distribution functions. This is demonstrated in Table 
25.2 and Table 25.3. 


Table 25.2 Some experimentally determined parameters. 


K, of distribution function 


wy € [4x 10!, 8 x 101°] 
Ew, € [4.95 x 10/9, 5.05 x 10!°] 


Random variable Class 
Critical density of the flow at 
temperature 3 — 4.2romanK (wi) 


Relation of the expansion coefficient 
to the solidity coefficient 
(for the strengthening) (w2) 


wg € [0.90, 1.40] 
Ewa € (1.095, 1.105] 
Relation of the expansion coefficient 
to the solidity coefficient (for the 
system of shielding flow tubes) (wg) 


w3 € [1.05, 1.25] 
Ewg € [1.195, 1.205} 


Table 25.8 The main economic parameters. 
Class GW of distribution functions 


w94 € [0.3375, 0.5625] 
Ew € [0.40,0.50] 
we € [0.135,0.165] 
Ewasg € [0.145, 0.155] 


Meaning of the random variables 


Coefficient of the price of material 
(for the cold zone) (ws) 

Coefficient of the price of installation 
(for the cold zone) (w26) 


Coefficient of the price of material 
(for the cryogenic-covering) (w27) 
Coefficient of the price of installation 
(for the cryogenic-covering) (w3s) 
Price of the refrigerator stations 
(w29) 


wa7 € [0.3375, 0.5625] 
Ew7 € [0.40,0.50] 
wag € [0.135,0.165] 
Ewag € [0.145, 0.155] 
Wag E [9, 11](min.tv.) 
Ewo9 € [9.7, 10.3] 
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We divided the set of the partially known distributions into two subsets Ky 
and K3. The minimal (maximal) value of the objective function with respect to 
distributions of class K; was F™" = 627. (F™2* = 642.7), and the result was 
obtained after 1243 (1406) iterations. This result shows that the value of the 
objective function does not depend strongly on the choice of distributions for 
these experimentally determined parameters. Therefore their following specifi- 
cation is not important. 

The minimal value of the objective function with respect to distributions 
of class K is F™™ = 623.9 (obtained after 2432 iterations), while the maximal 
value is F™* = 5967.8. This result, which shows that the value of the objective 
function depends strongly on the “pessimistic” bounds, calls for additional input 
from experts. 


25.8 Optimization of the Electricity Generating Stations 


A specific feature of the above problem is the abundance of inexact input data. 
We shall make use of this characteristic in stochastic programming methods 
which we shall use to solve some exploitation problems in electricity generation. 

The problem can be briefly formulated as follows: determine the active 
and the reactive powers of the electricity generating stations (the power is 
usually expressed as a complex number 2 = 2’ + Iz" and 2’ and 2” are called 
“active” and “reactive” power, respectively) so that the price of the electric 
power produced is minimized subject to the following conditions: 


- total production is equal to total consumption, 
- the resulting power flow is technically feasible. 


Let us denote the active power of consumer 7, by 5S; , its reactive power by 
Si, t = 1,...,p, and suppose that they are random variables with known 
distribution functions. For the stations we shall use 2 and 2/ to denote the 
active and the reactive powers, respectively, which must be in the intervals 
[at A], [o7, 2"], *=1,...,g. The cost of one unit of electrical power produced 
at the station 7 is c;,i =1,...,q. For every node 7 (power station or consumer) 
an interval [u;> 7], j =1,...,n,n = p+q for the voltage modulation is given. 
We shall take the active and reactive powers of the stations as control variables. 
Other control variables could include the transformation coefficients for some 
lines, the reactive powers of certain consumers, etc.—these do not influence the 
basic structure of the problem, but make its description more complicated, 
We use the following mathematical model to determine the vector 2’ = 


(z},. «+5 24) : minimize 
q 
L(2') = Seal 
i=1 


subject to 
q P 


vias, 


i=1 ist 
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al < ai < beta’, r=1,...,9 


This simple linear programming problem possesses an explicit solution Z’ = 
a'(o’), where o’ = }>?_, S!. Even when a quadratic objective function is used 
(instead of a linear one), the solution may be simply expressed by the values of 
the random variable o’. 

The values of the reactive powers—vector 2” = (x/,...,24)—have to be 
determined so that a technically feasible power flow exists. Let us explain this 
condition. If the values of the active and the reactive powers of all nodes are 
appropriately assigned, in the nodes of the system definite voltages u;, jy = 
1,...,, arise. Mathematically this is expressed by the fact that a nonlinear 
complex system called system of nonlinear equations of the power flow possesses 
a solution u = (uw ),...,%n) 


n 

S'4 1S) : 
Dagny EE thy Fe tye 
jel ' 


tt ill 
SP aaieee a=1,...,9. 
j=l 
Here {a,;j},2 = 1,...,n,7 =1,...,n—the admittance matrices which include 
also complex constants. By I, the imaginary unit, and by %,, the conjugate 
number of u; are denoted. This system consisting of 2n equations for 2n un- 
known we shall denote by 


W(2",y,w)=0, g=1,...,2n, (25.7) 


where w is a random vector including the consumers powers and also the vector 
2'(o") and y = (yi, .++5¥nyYn¢1s+++)¥2M) is the vector composed by components 
of the voltages of the nodes. 

For fixed values of the components of the vector w a vector z” has to be 
found such that the following condition to be satisfied 


yeY = {ye R*"|u, Sy; <j, foHl,...,2,0 < yngy S$ 27,7 =1,...,n}. 


Since this problem must be solved in real time, it is convenient to apply a 
parametrization of the solution: the solution will be searched as a prior? given 
vector-function x” (w) = 2”(v,w), which depend on the random vector w and 
on the unknown vector v € #™ that has to be determined. For convenience let 
us denote 

hi (a't(v,w),y,4) =g'(v,y,w), t=1,...,2n, 


F(v,y,40) = max |g'(v, 9,4) 


We state the following problem for the vector v: minimize the function 


F(v) = Emin /(v,9,«) 
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subject to 
vEV ={v ER" |Pl(al <2" (v,w) < 67} =1, t=1,...,q}. 


Usually functions 2” (v, w) are chosen as linear functions, for example z!’(v,w) = 


vo", t=1,...,¢, where o” = $>?_, Si’. In this case the set V is a parallelo- 
gram. 
V={veER*|v, = oy’ /maxo” <y< Bi] min 0” =%;, PP sey els 


Now we shall describe a numerical method for solving the problem with such 
parametrization. The method is of the stochastic ¢-quasigradient type methods 
(see [6]). 

Let v° € V be an initial point and let after e iterations, we have arrived 
at v®. Then we choose the observations S/(s),.S/”(s),7 =1,...,p, in accordance 
with their distribution, compute the vector 2’(o’,) as a solution of the linear 
programming problem described above for o’ = of = >>?_, Si(a), and also 
compute a” = )>?_, S!’(e). Thus we have determined the vector w*, composed 
by S!(8),S/"(e),2=1,...,p, and 2’ (0%). 

Then we determine a vector y° = y°(v°,w*) such that 


f(v’, y*,w*) < main f(r", ys") + €4,€, > 0, 


and define 
a* = argmax |g‘ (v°, y*,w*)|, 
; 


b. = sign gj (v*,y*,w*). 
We compute the point 


ae? = max{v;,min|V;,07 — pr540%, (v*,y*,w°)]}, ¢=1,...,4; 

where p, > 0 is the stepsize and gi (v8, y®,w*) is the gradient of gi” (v,y,w). 
The determination of a vector y* = y*(v°,w*) when v°,w® are given, is a 

well-known problem in the electroenergetics, the so-called ‘problem of the power 

flow’. For its approximate solution, numerous methods of nondifferentiable 

optimization can be used. As a termination criteria, the following unequality 


has been applied 
k+r 


u ) f(v°,y°,w*) < 0.5. 
7 
6=k 


Example: This example illustrates the computational results for a network 
with 6 nodes, 3 power station, and 3 consumers (p = 3,q = 3,n = 6). The active 
and reactive powers of the consumers are supposed to be normal distributed 
and may take values which are not more than 20% less or greater than their 
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mathematical expectations. As input data for the powers of the consumer we 
give only the values of the expectations. The active and the reactive powers 
of the power stations have to be determined (as we have described above) and 
intervals for their volumes are given. At the end, for every node an interval for 
the nominal tension of the voltage is given. 


Input data for the nodes. 


No. Type Active power Reactive power Nominal Tension 


consumer 300 (MW) +20% 150 (MVAR) +20% [210, 230] (KV) 
consumer 150 (MW) +20% 100 (MVAR) +20% [205, 230] (KV) 
consumer 150 (MW) +20% 50 (MVAR) +20% [210, 230] (KV) 
station —[0, 300] (MW) [0, 200] (MVAR) [390, 410] (KV) 
station  [100, 500] (MW) —_[0, 300] (MVAR) [205, 230] (KV) 

station  [100, 500] (MW) = [0, 300] (MVAR) [390, 410] (KV) 


oOaoarhkwWne 


The input data for the electro-transmission line consists of the admittance ma- 
trices of the lines. We shall not describe all these complex numbers and only 
note that 8 branches (lines) are assumed. 

The computational results was obtained after 109 iterations ( 18 sec., 
when the computer ES-1040 is used). The parameters 01,2, v3, of the linear 
parametrization 


(af =v0", ¢=1,2,3, o” =S/4+57+S7) 
were determined as follows 
v, = 0.38, v2 = 0.89, 73 = 0.53. 
The value of the objective function is 
F(v) = 0.4805. 
This result shows the average value of the maximal “nonbalance” in the system 
of the power flow (25.7), when the reactive powers of the stations are chosen 


in accordance to the parameterization low described above. Such result is fully 
satisfactory from the technical point of view. 
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CHAPTER 26 


POWER GENERATION PLANNING WITH 
UNCERTAIN DEMAND 


O. Janssens de Bisthoven, P. Schuchewytsch and Y. Smeers 


Abstract 


We consider a multistage stochastic version of the power generation planning 
problem and present a solution technique for tackling it. The model can in- 
clude uncertainties in the cost and demand parameters as well as in the technol- 
ogy matrix; it embeds the classical LOLP reliability constraints. The solution 
method is a mixture of decomposition and cutting plane techniques. Because 
of the complexity of this type of problem compared to the more classical LP 
formulation, we provide a discussion of its practical relevance on the basis of a 
case study. 


26.1. Introduction 


Power generation planning consists of finding the mix of new production capac- 
ities that will satisfy the future electric demand at minimal investment and op- 
erations cost. The problem has given rise to many mathematical programming 
formulations that would be too long to recall here (see [1] for some references). 
In its most usual form (see the classical paper by Anderson [2]) the model is 
formulated as the following linear program 


T 
minimize So {Kru + crtr) (26.1) 
r=1 
T 
subject to So Atrtr +Biee =, t=1,...,7 
r=1 
Crm =O, t= 1, af i 
T 
SS" Diryr S &, t=1,...,7 (26.2) 
r=1 


that we interpret as follows. y and z are respectively the vectors of investment 
and operations variables. The objective function evaluates the present value 
of the capacity expansion and exploitation costs over the horizon. The first 
constraint provides a linkage between the operations and investment variables, 
it expresses the fact that the exploitation is limited by the existing capacities. 
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The second constraint summarizes technical restrictions on the operations of 
the plants (e.g. lack of flexibility of the nuclear plants) and the satisfaction 
of the demand. Special attention must be given to inequalities (26.2) which 
are introduced as surrogates of a reliability criterion. In their most common 
form they express that the total installed capacity must be larger than the peak 
demand plus some margin. 

While models of the type (26.1)—(26.2) are usually sufficient for long term 
scenario studies, some authors ({1], [8]) have introduced more refined tools 
where the linear inequalities (26.2) are replaced by a true reliability criterion. 


Fi(yis-- +5 yt) Sd (26.2') 
which, in one of its common forms, expresses that the probability of not being 
able to satisfy the peak demand cannot be larger than some amount d; . This 
criterion, usually referred to as the loss of load probability (LOLP) makes the 
new model (26.1)~-(26.2') considerably more difficult to solve than its linearized 
counterpart (26.1)-(26.2). Other versions of the problem which use slightly 
different reliability criteria (loss of energy probability (LOEP)) are equally dif- 
ficult. Bender’s decomposition has been proposed as a natural way to tackle 
these more complex problems. 

We consider in this paper the treatment of a stochastic version of (26.1)- 
(26.2) where uncertainties can appear in the cost coefficients, the demand pa- 
rameters and the technology matrix. Problems of this type are of immediate 
interest these days where parameters such as investment and fuel costs, demand 
or availabilities of certain plants are typically uncertain. 

This extended version of (26.1)—(26.2’) can be stated as a multistage sto- 
chastic program with recourse (see [4], [5]). In order to stick to the solution 
procedure adopted in this paper we shall immediately define the extensive form 
of the deterministic equivalent of the problem. 

Let ET designate an event tree of depth T,II, is the probability of node 
zt and A(t) the set of its ancestors (including 7 itself) in ET’. We consider the 
following multistage linear program 

minimize yp Tl, (Kiys + cz) (26.3) 
feET 
subject to >; Ayjyj + Biz; =aj, 1€ ET 
JEA() 
Ciz; =b;, t€ ET 
F,(yj,3€ A(t) <4), *E£T (26.4) 

The model (26.3)—(26.4) will usually be quite large and hence difficult to 
solve; it is handled in this paper by a mixture of decomposition and cutting 
plane techniques which is discussed in Section 26.2. The current implementation 
of the method is presented in Section 26.3. The last part of the paper discusses 
the relevance of the approach compared to the more classical deterministic 
models. This is done in the context of a study of the commissioning of new 
nuclear capacities in Belgium in 1984. 
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26.2 Methodological Aspects 


This section is devoted to an intuitive discussion of the method adopted for 
solving the problem (26.3)—(26.4). Our aim here is more to motivate the general 
approach than to provide a rigorous treatment of it (see [6] for an exposition 
and a convergence proof of the mixed decomposition/cutting plane algorithm 
used). Throughout the paper, the discussion will be illustrated with the help 
of the event tree given in Figure 26.1. 





5 


Figure 26.1 Illustrative event tree 


Consider the linear programming problem consisting of the set of relations 
(26.3) only. It has a lower triangular block structure which in the case of our 
example is represented on Figure 26.2. 


Various algorithms exist for taking advantage of this property of the matrix. 
We shall in this paper rely on the extension of decomposition [7] and nested 
decomposition [8] proposed by Kallio and Porteus (9] for arborescent linear 
programs. By definition the program 


N 
Min >»? eezte 
é=1 


N 
S Breve = Ses k=1,...,N 
é=1 
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Figure 26.2 Block structure of the matrix 


is arborescent if there exists an arborescence having nodes 1 to N and such 
that Bye # 0 implies the existence of a directed path from & to @ As we 
shall see both the primal and the dual of problem (26.3) can be looked at 
as arborescent programs. The implementation of the method is described in 
[10], the following summary of the principle of the approach will suffice for 
our purpose in this paper. We consider an arborescent matrix as illustrated in 
Figure 26.3. Decomposition proceeds by breaking the original model into a set 
of nested masters and subproblems according to the structure of the matrix. 
Referring to Figure 26.3 the original problem, noted 7 consisted of a coupling 
block and two linked blocks noted 3 and 6; each of these latters has the same 
structure as the original model, namely a coupling constraint set and two linked 
matrices. 


In the decomposition algorithm (see Figure 26.4) the global problem 7 will 
be replaced by a master problem {noted 7) that will receive proposals from its 
subproblems 3 and 6 and to which it will transfer prices. Because of the nested 
block structure each of the subproblems can itself be replaced by a master 
problem (also noted 3 and 6 respectively) which receives proposals from its 
own subproblems (subproblems 1, 2 for the master 3 and subproblems 4, 5 for 
the master 6) and returns price signals. 


Particular cases of this general decomposition method arise when the ma- 
trix reduces to a single block angular structure ({7]) or, when each master only 
has a single subproblem ([8]). 

Arborescent linear programming can be applied in different ways to model 
(26.3). Working directly on the primal problem, an exploitable structure is the 
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Note: We have assumed that the tree describing the matrix is identical to the 
event tree of Figure 26.1. This is by no means necessary but will help in later 
discussions. 


Figure 26.38 Nested block angular structure 


one indicated in Figure 26.5. It is easy to see that this corresponds to taking 
advantage of the sole multitemporal aspect of the problem. From the point 
of vie of the data structure, this implies that the size of the subproblem in a 
given time period is determined by the number of nodes in that period. This is 
admittedly embarrassing in a stochastic program where the number of terminal 
nodes can quickly become large. 


In contrast, working on the dual permits a much higher degree of decom- 
position. The structure of the dual matrix is given in Figure 26.6 which also 
shows its nested block angular structure. The size of the subproblem is then 
entirely determined by the size of each block in the matrix. This is a much 
more favorable situation and it is this structure that we shall exploit here. 


We now turn to the handling of the reliability constraint. It is most com- 
mon in power generation planning to characterize a plant by its rated power 
U and its availability factor p (see [2]). Leaving aside, for the time being, the 
fact that we are dealing with continuous capacity variables and not multiples 
of the rated power, the loss of load probability in a node ¢ of the event tree 
can be defined as follows. Let €; be the demand of electricity in node 7, €; is a 
random variable whose distribution is entirely determined by the load duration 
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Figure 26.5 Nested block structure obtained by working on the primal prob- 
lem 


curve. Let S,; be the set of plants existing in node 7 and {y;,7 © A(z)} the 
vector of installed capacities (that we take as integer variables in the course of 
this discussion), we can define for each plant 6, the random variable 7,; equal 


Power Generation Planning 471 


TED 

sates 
Tm 
7] 






























Figure 26.6 Matrix structure of the dual problem 


to the available capacity of plant type ¢ in node 7. The reliability criterion in 
node ¢ can then be written as 


Pr i Nei S &:|Us, (yy.3 € A(i))] < dj (26.5) 
s€S; 


which is a chance constraint. Besides very special cases, it is impossible to write 
a deterministic equivalent of (26.5) which is, in any case, already difficult to 
evaluate numerically (see [11], [12], [13]) for examples of numerical methods). 
The inclusion of reliability constraints in planning models has mainly been done 
through Benders’ decomposition ([2], [8]); we shall follow a similar approach 
but reason instead in terms of cutting planes. 

Let (A;,7 € ET) be the vector of {exogenously determined) capacities to 
be scrapped. Starting with the solution (y;,7 € ET’) of problem (26.3), that is 
without reliability constraints, one can define the available capacity z; at node 


7 as 
= Sn Eas 
SEA) JEA(H 
Strictly speaking, the loss of load probability is only defined for values of 2; 
that are multiples of the rated capacities; let [z;] be the vector derived from z; 
by rounding down the capacities to multiples of the commercial powers and 6; 


be defined as 

F; ([zi)) — Fi ([zr] + e6Ue) 
U~z : 
where eé, is the s-th unit vector. 6;, can be seen as the decrease of the loss of 
load probability resulting from a unitary investment in plant s. If the reliability 
criterion is not satisfied at [zi] we add the constraint 


F;([z]) Ba > Sia (tia — [zis]) )< d;. (26.6) 


bes; 


S56 = 8 € S; 
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This amounts to replacing the initial reliability constraint by some inner lin- 
earization. 

Because the storage of a linear program and its manipulation by the revised 
simplex method are essentially column oriented, the addition of a cut is not a 
natural operation in most commercial codes. This is a fortiori so if the solution 
technique is based on column generation such as in decomposition or nested 
decomposition. In contrast, the addition of a cut in the primal becomes the 
addition of a column when working in the dual and can thus be nicely inserted 
in the decomposition algorithm. The implementation of the combination of 
these techniques is discussed in the following section. 


26.3 Implementation 


Stochastic optimization although introduced at the very beginning of mathe- 
matical programming does not seem to be in widespread use. This may be due 
not only to the lack of specialized codes capable of dealing with these problems 
but also to the fact that stochastic models seem, at least in our experience, more 
difficult to formulate (event trees are more complex to arrive at than scenarios) 
and to generate (commercial matrix generators such as OMNI [14] do not per- 
mit easy manipulation of trees). It thus seems essential in order to implement 
the approach discussed in the preceding section to leave the maximal possible 
freedom to the user and in particular to refrain from imposing him constraints 
originating from the solution procedure. The following approach has thus been 
adopted. In a first state the user writes the extensive form of his model in the 
MPS format using standard matrix generation techniques. A program trans- 
forms this version of the primal model into an MPS representation of the dual. 
A third program rearranges the input of the dual in a form suitable for the de- 
composition code. The fourth stage is the optimization itself; the last one, the 
report writer, is essentially missing in the current implementation but should 
be developed in the future. We briefly review these different stages. 


26.3.1 Problem generation 


While standards exist for defining two stage stochastic programs [15], the case 
of multistage models remains largely untouched. We have assumed in this work 
that the modeler directly constructs the extensive form of the deterministic 
equivalent of his problem in MPS format using a commercial matrix generator. 
We allow him the most general formulation of a linear programming problem, 
namely 
Min c's 

r<ArSe (26.7) 

€<2dsu 
which contains ranges on the constraints and bounds on the variables. In order 


to allow for subsequent treatment, it is required that row and column names 
corresponding to a given node have as their two last characters the identificator 
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of the node; the current implementation supposes that the nodes of the tree are 
numbered in postorder; this constraint can however be relaxed easily. 


26.3.2 Construction of the dual problem 
The dual of (26.7) is written as 


minimize — e'y—roz—ubo — Cw 

subject to Aty +Abzetu+w=e (26.8) 
y,v <0 
z,w >0 


and is constructed automatically from the MPS input file of the primal problem 
(MAGENOUT file in the OMNI system). The formulation is rather unusual 
to the extent that it involves nonpositive variables (y and z) as well as the 
more common nonnegative variables (z and w) . Jt can be justified as follows: 
numerical elements in MPS format files are represented in twelve character 
fields, one character being used for specifying the sign of the number. Our 
version of the dual problem can be defined through an MPS file which contains 
the same numerical elements as those of the primal problem and hence does not 
require any change of sign (no change of sign is required in the constraints and 
the minus signs of the objective function can be generated using the facilities 
of the MPS software), this permits keeping the 6,r,u and @ with their original 
sign in the dual MPS file. Besides the time gained by not having to change 
sign, this construction leads to a dual which is numerically fully equivalent to 
the original problem. 


26.3.8 Rearrangement of the MPS input file 


This rearrangement is specific to the decomposition code used. It is discussed 
in detail in [10]. 


26.3.4 Optimization 

The main features of the decomposition code are discussed in [10] and will not 
be recalled here. The interaction between this code and the reliability criterion 
is represented on Figure 26.7. This part of the implementation is currently far 
from optimal; the following discussion will help clarify the issue. 

The decomposition code is fed with the reorganized MPS file of the dual 
problem (see section above). Because decomposition methods provide feasible 
dual solution every time a cycle with a bounded subproblem is completed, it is 
possible to extract the dual solution when convergence has almost been reached 
and to evaluate the corresponding capacities in each node of the event tree. The 
reliability criterion is then evaluated everywhere or for a subset of the nodes 
where the user feels that the LOLP is most likely to be violated; additional cuts 
are generated when necessary. 

The approach, although simple in principle present several challenging fea- 
tures that have not been handled most satisfactory now. We briefly report on 
these in the following. 
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Figure 26.7 Interaction between the decomposition code and the reliability 
criterion 


Computation of the reliability criterion 


The evaluation of the loss of load probability is a costly operation and it is out of 
question to restart it from scratch at each evaluation of the reliability criterion. 
The cumulant method introduced in [12] and [13] provides an elegant solution 
to this problem. The cumulants (see [16] for the definition of this notion) of the 
different plants and of the load duration curves at each node ¢ can be computed 
once for all at the outset of the study; the evaluation of the reliability criterion 
is then drastically reduced afterwards. 
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Insertion of reliability cuts 

As mentioned above, a cut becomes a colurnn in the dual and can be added 
relatively easily in the input of the dual. Some elementary restart procedures 
have been included in the current decomp osition code which allow one not to go 
through the whole optimization from scratch. Although these cuts could ideally 
be directly added to the internal representation of the different subproblems as 
new columns, this approach has however not yet been explored and the cuts 
are now included through the MPS file. 


Report generation 

The question of the connection of the decomposition code with a commercial 
report writer has not been explored yet. At this stage the dual variables of 
the subproblems (the primal variables of the original problem) are directly 
extracted from the internal representation of the solution of the subproblems 
and constitute the output. This implies that the report writer must be written 
in a general purpose high level language (such as PL/1 with MPSX/370). 


26.4 A Case Study 


This machinery is rather complex, at least compared to the direct use of a 
commercial linear programming code. It is thus important, before resorting to 
the stochastic programming approach, to evaluate the additional insight that 
it can bring into the decision process. The following discussion is taken from 
a study of the commissioning of new nuclear plants in Belgium in early 1984. 
The general decision context is discussed in [17]; we focus here more on the 
numerical results. Consider the event tree of Figure 26.8 where the probabilities 
of the different scenarios are indicated at the right of the corresponding terminal 
nodes. 


low energy price and high demand O 
) 


growth rate (2.7% 3 







high energy price and moderate demand 
growth rate (2.3%) 








collapse of the steel industry and 





stagnant electricity demand (0%) 3 


Figure 26.8 Event tree relative to the commissioning of new nuclear plants 
in Belgium in 1984 


The tree has been constructed by a governmental agency and taken as such. It 
models a process where investment decisions must be taken in 1984 and 1985 
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without knowing the future demand. This latter is supposed to be revealed 
in 1985 and later decisions taken with perfect foresight from that period on. 
Relevant to the use of stochastic programming is the existence of two relatively 
similar scenarios (2.7 and 2.3%) together with a more contrasted one (0%). 
Dropping the last scenario would probably make the stochastic model useless; 
having more contrast in the two first evolutions would increase its interest. 

The discussion will focus on the size of the model and the impact of both the 
uncertain demand growth rate and the reliability criterion. We shall conclude 
by some quantitative evaluation of the stochastic programming approach and 
comments on its implications for policy analysis. 


Size of the model 


A first criticism against the use of the preceding machinery arises from the 
present capabilities of commercial codes. It can indeed be claimed that, given 
the existing possibilities of these codes, it is simply not reasonable to set up 
models that require more computational resources. In order to assess this argu- 
ment, consider Table 26.1 which reports the capacities of nuclear plants coming 
on line in 1994 and 1995 with deterministic models where the horizon is limited 
to 1995 and 2000 respectively. 


Table 26.1 1993 and 1994 nuclear capacities with deterministic models of 
different horizons (capacities coming ou lines (in MW)) 


1993 1994 
horizon 1995 horizon 2000 horizon 1995 horizon 2000 
2.7% 508 1972 575 442 
2.3% 1065 1681 405 412 


Both versions of the model deal with end effects by assuming that the salvage 
values of the plants at the end of the horizon is equal to the discounted sum 
over the rest of the technical life of the annual values of the investment cost. 
The difference of the results clearly points to the importance of recurring to the 
longer term horizon model. A stochastic version of the problem limited to a 
1996 horizon has about 9000 constraints. It certainly challenges the possibilities 
of commercial codes such as MPSX/370 but can be handled by them. The 
2000 horizon model has about 20,000 constraints and cannot be handled by 
MPSX/370. The longer term horizon model appears necessary, but commercial 
software would have difficulties (or find it impossible) to solve its stochastic 
version. 
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Uncertain demand growth rate and reliability constraint 


The handling of the reliability criterion certainly adds to the complexity of the 
methodology. Because LOLP constraints are rarely treated explicitly in long 
term power planning models, one may question their usefulness in this more 
complex set up. Table 26.2 reports the results of a pure scenario analysis and 
of the stochastic model when the horizon is limited to 96 (the shortened version 
has been selected to reduce computer costs) with and without accounting for 
the LOLP constraints. Perfect foresight induces the immediate commissioning 
of new nuclear plants in the scenario approach, which results in the satisfaction 
of the LOLP criterion (except in the 2.7% case where gas turbines are required 
for reliability purposes). In contrast, the uncertainty about the growth rate 
first postpones investment decisions which however remain of nuclear type. 
The obtained generation system, however, violates the reliability constraints in 
1994 and 1995. The role of the LOLP constraint appears in the last three rows 
of Table 26.2. While gas turbines are again coming on line as soon as 1990 in 
the 2.7% scenario, coal fired plants are introduced for reliability purposes in 
1994 and 1995. This new effect justifies considering the LOLP constraint in 
stochastic model. 


Table 26.2. Comparison of the scenario and stochastic approaches under 
different LOLP constraints (capacities coming on lines (in MW)) 


1990 1993 1994 1995 Remarks 


2.7% - 508 575 - All investments are 
nuclear. Gas turbines 
Scenario 2.3% - 1065 405 - are introduced 
in 1990 with the 
0% - = - - 2,7 scenario 
Stochastic 2.7% . - - 856 All investments 
without 2.3% - - - 1170 are nuclear 
LOLP 0% : - . - plants 
117. 330 =©938 coal 
2.7% 
155 : - - gas turbine 
Stochastic 
with - - - 1042 nuclear 
LOLP = 2.3% 
- 735 171 - coal 


0% : . : 
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Valuation of the stochastic programming approach 


We now consider the 2001 horizon model and evaluate two criteria usually found 
in the literature in relation to stochastic models. The value of information [18] 
compares the expected cost obtained in the deterministic scenario studies and 
the objective function value of the stochastic model. It corresponds to the value 
of perfect forecast. The value of the stochastic solution [18] evaluates the gain 
brought about by acting according to the solution of the stochastic program. 
Supposing a certain behaviour of the decision maker (for instance selecting a 
mean value approach in the first periods) it compares the cost resulting from 
that behaviour to the one associated with the solution of the stochastic program. 
Taking the value of information first, one finds that the average cost of the 
scenario models amounts to 7 656 10° $ (of year 1982) while the cost of the 
stochastic programming model is 7 714 10° $. Although this may look like a 
negligible difference in percentage, it is certainly important when considered in 
marginal terms. Because the generation system remains basically unchanged 
until 1994 (we can neglect the additional gas turbine capacities of 2.7% scenario 
which are only introduced for reliability purpose and are not exploited) the cost 
differences must be related to the eight years of the period 1994-2001 which, 
after proper discounting operations, amounts to 25.6 10° $/year. 

The situation is more striking for the value of the stochastic solution. Tak- 
ing the average of the deterministic solution as the initial decision we end up 
with an infeasible stochastic programming approach. This corresponds to an 
infinite value of the stochastic solution. This result can be explained as follows; 
two policy constraints are implemented in the zero growth scenario which have 
to do with particular features of the Belgium situation; one requires an addi- 
tional consumption of national coal in case of the collapse of the steel industry; 
the other one imposes a minimum level of operations to the new nuclear plants. 
Admittedly these constraints have little economic sense; they have however a 
lot of political relevance and formalize concerns often expressed in the public 
opinion. Together they render the operations of the power sector in 90 infeasible 
in the 0% growth with the investments resulting from the mean value approach. 
This is admittedly an extreme case (which does not appear in the 96 horizon 
model) it however shows the utility of the stochastic programming approach 
with respect to the more classical scenario approach. 


Policy implication 

The commissioning of new nuclear plants in Belgium has been delayed from 
1981 to 1984 when a small participation to a French station (~ 450 MW) was 
decided. The discussions during those three years have mainly concentrated 
on demand forecasts and on whether, because of the current uncertainties, one 
should not defer any immediate decision. The scenario approach, with its first 
stage decision depending drastically on the assumptions, has been relatively 
difficult. to use in that context. In contrast, the stochastic programming ap- 
proach, because it immediately deals with the whole set of scenario answers the 
question of whether it is better to wait until additional information is available. 
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26.5 Conclusion 


The present uncertainties that prevade the economic environment of the utilities 
make the sole use of classical deterministic power generation planning models 
difficult to justify. In particular the scenario approach, whatever its usefulness 
for exploring the impact of uncertainties on present decisions, can prove use- 
less when the solutions are too much different for equally plausible scenarios. 
Stochastic programming has long been proposed as a natural way to tackle the 
problem. We present an implementation of the approach and show that it is 
both computationally feasible and practically relevant. 
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CHAPTER 27 


EXHAUSTIBLE RESOURCE MODELS WITH UNCERTAIN 
RETURNS FROM EXPLORATION INVESTMENT 


J.R. Birge 


Abstract 


Exhaustible resource models that do not consider exploration investment have 
typically low values of perfect information and sometimes even optimal myopic 
policies. In this paper, we add exploration and capacity investment and allow 
the returns from exploration to be stochastic. We show that, in this model, the 
stochastic program solution may be quite valuable and that myopic policies are 
far from optimal. 


27.1 Introduction 


Exhaustible resource models have been studied by a number of authors. Hotel- 
ling [8] initially formulated a model that demonstrated that the market price 
of an exhaustible resource grows exponentially as it is depleted. Nordhaus 
[7] introduced the idea of a “backstop” technology to this model. The result 
is the Hotelling-Nordhaus model in which a finite resource is used until its 
production cost exceeds that of the inexhaustible backstop technology. The 
backstop technology is then introduced and the two technologies are never used 
simultaneously. 

Manne [5] and Manne and Richels [6] use the Hotelling-Nordhaus model 
in their analysis of the effect of the uncertainty of the introduction date of the 
fast breeder reactor. They formulate a stochastic linear program and solve it 
to find the expected value of perfect information (EVPI). Their results indicate 
that the expected value of perfect information in this model is low and that, 
therefore, deterministic problem solutions provide close approximations to the 
solution of the stochastic problem. 

Chao [2] presents an analytical justification for the observations of Manne 
and Richels. He formulates a mathematical program for the Hotelling-Nordhaus 
model. Under certain assumptions that include a demand that is independent 
of price, Chao shows that a myopic policy of using the most inexpensive avail- 
able technology first is optimal. He also introduces a price responsive demand 
function to his model and again shows that the EVPI is low. 

In this paper, we expand upon Chao’s model by allowing exploration in- 
vestment that could yield additional resource supplies. The arnount of increase 
in the supply per unit of investment is however uncertain. We show that the 
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EVPI and the value of the stochastic solution (VSS) (Birge [1]) can be large 
when this type of uncertainty is included. We give examples illustrating these 
observations. 


27.2 The Basic Model 


Our results concern two measures of the effect of uncertainty in stochastic pro- 
grams, the expected value of perfect information and the value of the stochastic 
solution. We present these measures in the context of two-stage stochastic pro- 
grams with recourse. We first formulate the deterministic program 


minimize y(z,€) = cz+min[gy|Wy = €+T2,y 2 0] (27.1) 

subject to Ar =b,2 >0 ; 

where the vectors c € IR", g € IR”, and 6 € R™ are known, the mo-vector € is 

a random vector defined on the probability space (E,7,F), and A, W, and T 

are correspondingly dimensioned known real-valued matrices. A decision vector 

2(€) obtained in Program 27.1 represents an optimal first period decision given 
a realization é of the random vector. 

If an optimal first period decision is taken for all possible realizations of 

the random vector, then we obtain in expected value the “wait-and-see* (WS) 
solution value (Madansky [4]), where 


WS = K/min (2, 6). 


The stochastic program with recourse (Wets [8]) involves optimizing after tak- 
ing the expected value. We write the value of this program as 


RP = min Eelp(2,6)] 


For E(€) = €, we obtain a third value that is the expectation of the expected 
value (EEV) solution z(€) that is optimal in (27.1) for € = €. This quantity is 


BEV = Exly(@(@), 8). 


The effects of uncertainty are measured by differences among WS, RP, and 
EEV. The expected value of perfect information represents the amount one is 
willing to spend in gaining information about the stochastic variables. It is 
calculated as 

EVPI= WS — RP. 


The value of the stochastic solution, on the other hand, measures the ad- 
ditional value of solving the stochastic program over solving the deterministic 
expected value problem. We define 


VSS = EEV —RP. 
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In the discussion below, we describe VSS and EVPI in the context of an ex- 
haustible resource model originally due to Chao. 

Chao’s basic model is a linear program to determine an optimal dynamic 
production schedule to minimize the present value of the cost of satisfying an 
increasing sequence of demand requirements over time. The demand may be 
satisfied by any of m — 1 substitutable technologies, each using one distinct 
finite resource, and by one backstop technology with no resource limit. The 
resulting linear program is 


m 00 m T 
minimize ye S Be; Yie + > 3S BY ki aie 


f=1 t=1 #=1 €=0 
co 

subject to vie <R, t=1,...,m; 
f=0 


27.2 
"vit =Dr, t=1,...,7; ( ) 


00 
Ys,t+1 =yit +) (5e— 55-1) 2it—s5 i=0,1,..., 
6=0 


yit 2 Osi D0; € =0,1,...3¢=1,...,m 


where y;¢ is the amount of period ¢ demand, D;, satisfied by resource ¢ at time 
t, xj¢ is the amount of resource 7 committed at ¢ to be extracted later, c; is 
the current cost of technology 7, k; is the capital cost of ¢, # is the discount 
factor, 5; is the extraction rate, and R, is the initial availability of the resource 
used by technology ¢. It is assumed that y;, and z;, are known for 7 = 1,...,” 
and for ¢ = 0,—1,..., and that yj9 = per 6_¢2z;j¢. It is also assumed that 
D,<Do<...< Dr_, < Dr. 

Chao defines + as the capital recovery factor for the standard time profile 
where 7 = 1/(3>>2. 8°6,) and lets d, be the demand for new resource commit- 
ments where Dy = sumo2.95,d:_. The result it that (27.1) can be rewritten 
as 


m 


T 
minimize 2 du (ki + ci/y) Mie 


sy it 


—oo 
subject to Sait < Ri — 5 (Do be)zies F= 1y-..4m3 (27.3) 


t=0 t=-1 s=-t 
do ait =, t=0,...,T; 


zie 20, t=1,...,m; and allt. 


Chao uses Program 27.3 to derive his results on myopic solutions. He shows 
that the corresponding transportation problem can be solved optimally by the 
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Northwest Corner Rule if the resource costs k; +¢;/ are arranged in increasing 
cost order within each period. 

The result leads to an expected value of perfect information of zero because 
the WS solution is the same as the RP solution. It also yields a VSS of zero 
because the EEV value is the same as RP when myopic solutions are optimal. 

Chao introduces price-responsive demands to the basic model in (27.3) and 
obtains a nonlinear programming model that does not have myopic optimal 
decisions. He computes an upper bound on the EVPI and shows that distant 
future uncertainties and low price elasticities lead to a small EVPI. In the next 
section, we introduce investment uncertainty into the basic model and show 
that this may lead to a significant EVPI and VSS. 


27.8 A Model with Uncertain Exploration Returns 


We assume that A; in Program 27.3 represents the amount of resource 7 that is 
known to be available at time 0. This amount can be increased by exploration 
investment, but the amount of the increase is uncertain. We also assume that 
there is a capacity limit L; on the amount of a resource which may be committed 
at time 0. This amount may also be increased by investment in new capacity 
and that return is assumed known with certainty. The stochastic linear program 
derived from (27.3) is then 


m ™m m 
minimize > (h +e;/y) aio +5 d;uyo + S> 9:%:0+ (27.4.0) 
t=1 r=1 f=1 
T m Ky ; 
Soo al (ki tei/a)ah + did, + greg} 
f=19=1 j=1 
; t-1 ; . tot ; 
subject to 2, < Ry + >» af) 42) = ye of) (27.4.1) 
s=—0 s=0 


ee sbi +>. vi), (27.4.2) 


20, ¢=1,...,mjt=0,...,037 = 1,...,Ke3 (27.4.4) 


where d; is the cost of one unit of exploration for resource 2, ul, is the amount of 
exploration, g; is the cost of capital investment in resource 2, wd, is the amount 
of that investment, rt is the probability of scenario 7 at time t, K; is the number 
of scenarios at time ¢, and al, is the return per unit of exploration for resource 2 
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under scenario 7. Each scenario 7 is preceded by ancestor scenarios in previous 
periods which are designated by a(). 

The stochastic nature of Program 27.4 is contained only in the return on 
exploration investment, o/,. In general, these values may vary continuously, but 
the discrete formulation in (27.4) is used for simplicity. This program involves 
a stochastic technolsgy matrix, but it may be formulated with stochastic right- 
hand sides by defining new variables wf,, £> 0, such that 


uf) = \ toliwf, (27.5) 


and 
< Av), Sock Pee (27.6) 


where head is the igre of resource 2 in period ¢—1, there are fi different 


values of a; 4-1, and wi, <0 for all 2 except for = & such that a af = Carder 


The upper bound on va is sufficiently large to allow any investment. level. 
The stochastic right-hand side problem is then formed by substituting (27.5), 
(27.6), and a constraint where R?, is set equal to the right-hand side of (27.6), 
for Constraint 27.4.1 in Program 27.4. 

In the deterministic version of (27.4), the investment decisions may skip 
from investment in one resource to another according to the values of at, . This is 
due to the basic property of the linear program in which extreme point values 
correspond to investments in single resources. The solution of (27.4) allows 
for many more combinations of alternative investment decisions and, hence, 
provides for hedging against other possibilities. This hedging characteristic 
yields a positive VSS for many cases and the value of knowing the investment 
return yields a positive EVPI. An example of these occurrences appear in the 
next section. 
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27.4 Example 

We consider a two period problem to demonstrate the potential effect of in- 
vestment uncertainty. In this example, we consider three technologies. The 
first technology uses a resource in which investment return is highly variable. 
The second technology corresponds to a resource in which investment in addi- 
tional capacity results in certain returns. The third technology is an infinitely 
available backstop. The data for the model are in Table 27.1. 


Table 27.1 Model Input Data 


Resources Current CostIinitial Availability 

Res 1 5.0 25.0 

Res 2 10.0 10.0 

Backstop 16.7 +00 

Investment Cost Return 

Res 1 - Good Luck 1.0 1.0 
Bad Luck 1.0 0.1 

Res 2 1.0 1.0 

Periods Demand 

First 15.0 

Second 25.0 

Scenarios Probability 

Good Luck 0.5 

Bad Luck 0.5 

Discount Factor f=0.6 


The only uncertainty in this model is in the return for Resource 1 explo- 
ration investment. Resource 2 investment can be interpreted as building addi- 
tional capacity. This model can be formulated as a stochastic linear program 
with recourse and with uncertainty in the right-hand side by using constraints as 
in (27.5) and (27.6). In this case, we obtain the following two-stage stochastic 
linear program in which « represents first period decisions and y represents 
second period decisions. 
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minimize z= 52, + L0z2 +16.723 + 24 + 5 + Eg[3y5 + bye + 10y7] 
subject to 2, < 25 

zq < 10 

a tag+23 215 

—a, ty t+-lys+ys,=0 

4 — Y3 —y4 = 0 

—% +25 +42 =0 

yas 

yit ys < 25 

y2t+y6 < 10 

ys tye ty7 2 25, 

Tyee, TS = 0,Y15+++5Y7 20, 

(27.7) 
where P{€é = 0} =0.5 and P{é = 10} = 0.5. In this program, 21, zg, and zg 
represent commitments of the resources, 4 and x5 are investment variables, y; 
and ya represent the net changes in resource availabilities, y,; and y4 represent 
the amount of new Resource 1 availability obtained through investment, and 
Ys, Yo, and yy represent commitments in the second period. 

The alternatives to Program 27.7 are to solve deterministic models that 
assume good luck, bad juck, a mean value with € = € = 5, or a single myopic 
solution. For each of these solutions, we obtain the expectation of the two 
period costs after using the first period solution obtained by these deterministic 
problems (as in finding the EEV). These values are 











Scenario Deterministic Value | Expectation Value 
Good Luck 175.0 
Bad Luck 200.0 
Mean 185.0 
Myopic 215.0 








These values can be compared to the value of the stochastic program (27.7), 
which is 192.5. 

We can then obtain the information values, EVPI and VSS. The expected 
value of perfect information is 


EVPI= RP — WS = 192.5 — 187.5 = 5.0. 
The value of the stochastic solution is 
VSS = EEV — RP = 200.75 — 192.5 = 8.5. 


The value of the stochastic solution relative to the myopic, or no investment, 
solution is also of interest. [t is 215.0 — 192.5 = 22.5. 
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The difference between the EVPI and VSS values demonstrates how these 
quantities reflect different values of uncertainty. The EVPI is lower than the 
VSS because the RP solution can fairly adequately hedge against either of the 
future outcomes. In the RP solution, there is investment in both Resource 1 
and Resource 2 capacity (sq = 10 and z; = 4) so that no backstop usage is 
necessary in either scenario. The mean value solution, however, only involves 
investment in Resource 1 so that the backstop must be used in the bad luck 
scenario. This leads to a higher VSS than EVPI and shows the merit of using 
the stochastic program solution. 

Investment in two resources is unique to the stochastic program solution. 
Any deterministic scenario only involves investment in one resource. This again 
shows the utility of the stochastic program. It is able to blend the determin- 
istic solutions so that the decision maker does not have to decide between two 
completely different solutions. 

We also note that the addition of investment has a significant effect on 
the value relative to the myopic solution. If no investment is allowed then the 
myopic solution would be optimal, and the backstop would necessarily be used 
to satisfy five units of demand in the second period. An exhaustible resource 
model with investment therefore clearly must consider future scenarios, and the 
solution of an equivalent stochastic program can have significant advantages 
over the solution of a deterministic expected value problem. 
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CHAPTER 28 


A TWO-STAGE STOCHASTIC FACILITY-LOCATION 
PROBLEM WITH TIME-DEPENDENT SUPPLY 


S.W. Wallace 


Abstract 


A stochastic facility-location problem with recourse is solved by the L-shaped 
decomposition method. The purpose is to find which plants, from a set of 
potential plants, should be opened. The supply is random and varying over 
time. 

To each potential plant is attached a fixed cost. The decomposition results 
in a stochastic transportation problem and an NP-hard problem with quasi- 
concave objection function and linear constraints. 


28.1 Introduction 


We are concerned with the following problem: A set of supply ponts is given, 
each point having a supply that varies over the year. The supply points in 
general have their supply peaks at. different times. The supply is random. 

A set of potential demand points is also given. We want to establish which 
of them should be kept/built and which should be closed/not built. For the 
existing ones we also consider the possibility of increasing their capacities. To 
each potential demand point is attached a fixed cost depending on the capacity 
of the demand point, which also is to be determined. 

Due to the variation of supply over time, we will divide the year into T 
time periods. Clearly we cannot expect the capacity at the demand points to 
be fully utilized in all time periods. Still the fixed cost will be the same in all 
periods, namely the one given by the amount received in the most intensive 
period. 

The problem is motivated by a problem from the Norwegian fish meal and 
fish oil industry. The supply points represent fishing grounds for which the 
quotas are stochastic and variable throughout the year. The demand points are 
potential plants (see Section 28.7 for further details). References [15] and [16] 
also give background information. 

Transportation costs are given between all pairs of supply/potential de- 
mand points. If handling costs differ among the plants, they must be included 
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in the transportation costs, and not in the fixed costs of the plants. The for- 
mulation is as follows: 


M2-1 


min Ey{min 77d civit+ DY hy (2s) (28.1) 
i gy ¢ j=l 
M2 
subject to Sof, =Si(w) t=1,..., Ti=2,...,M1 (28.2) 
j=l 


M1 
—Sog¢;2-2)  t=1,...,7, f= ,...,.M2—1 (28.3) 
f=2 


Os2; <4; g=il,...,M2—-1 (28.4) 
yj; 20 (28.5) 
where 
uf; equals the number of loads from supply point ¢ to demand point 7 in 
time period t. (We relax the natural integrality requirement.) 
Cry equals the cost per load sent from supply point z to demand point 7. 
@; equals the capacity in loads per time period for demand point 7. 
h;(a;) equals the fixed cost attached to demand point 7 as a function 
of a;. 
St(w) equals the uncertain amount supplied at supply point 7 in time 
period ¢. 


Note that demand point M2 has infinite capacity, i.e. it represents a 
recourse action such as sending to a second rate market or dumping. Therefore 
we will assume that cjago > cj; for j = 1,...,M2—1. Clearly the problem is 
always feasible. 

The requirements (28.4) might be dropped, depending on the situation in 
which the method is used. 

The function h;(z;) is assumed to be quasi-concave; in practice we will 
assume the following form 


Phe Le A; +hj2; ifz;>0,H; >0 


Although we are concerned about several time periods, our problem does 
not belong to “dynamic facility location” problems or “multiperiod capacity 
expansion” problems. An important reason for this is that although both prob- 
lems (usually) operate with T time periods (T finite), our problem does not end 
here, but rather starts in period 1 again. In dynamic location problems (see 
e.g. [5], [9]), however, time 7 marks the end of the time horizon. Therefore, 
the time of investment is important due to present value considerations. In our 
case, investments are done at the start of period one and capacity kept at that 
level throughout time. 
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Our problem will therefore, to a certain extent, belong to the one-period 
facility location problem, although that period is divided into T subperiods. 
The complications here are naturally due to the stochastic supply, but also: 


1. The size of the plants are variables, thereby spoiling the network structure 
of the constraints. 


2. The discontinuous (quasi-concave) objective function. 


3. The variable part of the fixed cost of a plant cannot be included in the 
transportation costs because we have more than one time period. 


Problem 1 will be attacked through decomposition, problem 2 through 
enumeration of extreme points or a series of linear programs. 

Problem 3 complicates the decomposition since the master problem of the 
decomp osition must determine not only which plants to open, but also their 
sizes. 

We will use the L-shaped decomposition of the problem, outlined in [13] 
and [18]. This amounts to writing the problem in the following form: 


M2-1 2 
minimize )~ h,(2;) +8 
j= 


subject to Q(z) <? 
O<2;<d; j=1,...,.M2—-1 


where Q(z) is defined as Q(z) = E,Q(z,w) and Q(z, w) is given by 
Q(z,w) =inf{S> > > evjyt [Ny (5) by > 0} (28.6) 
a | ~ 
where N’ is the coefficient matrix for the y’s in (2) and (3) and 
f= & )) is appropriately sorted to fit N’. 


The L-shaped algorithm is a “tight” cutting plane algorithm that in general 
allows for both feasibility and optimality cuts. Since our subproblem (i.e. to 
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find Q(z) for given) is always feasible, we will only need optimality cuts. 

A method very similar to the L-shaped is an outer approximation using 
nonstochastic tenders, see [12]. We will see later that due to some separability 
properties in our problem, these two methods are equivalent. 

The L-shaped algorithm can be viewed as a version of Benders’ decompo- 
sition [1], as applied to L-shaped structured problems. 

Note that the solution to our problem is a set of variables z;, 7 = 1,..., 
M2 —1, and nof a set of variables z; and y,;;. The problem is a so-called 
two-stage stochastic optimization problem or stochastic program with recourse. 
This means that first the decision-maker must determine the z,;’s on the basis of 
only the distribution of the supply. Then after the realization of the supply, the 
short run (second stage) recourse variables y;; are determined. When solving 
(28.6), we therefore (in general) will get different y’s for the different realizations 
of w, while e.g. [6] get a solution consisting of both y;; and z;. So even though 
these problems (i.e. ours and [6]) may look similar, their nature is significantly 
different. 


28.2 Determination of Q(z), the Subproblem 


Q(z) = E.Q(z,w) where 


Q(2,w) = inf OD Deavhla’(S) b',y > 0} and (28.6) 


b= ) appropriately sorted to fit N’. 


If we write this in more detail we get Q(z,w) = 


minimize > > y CifYf (28.7) 
i jy ¢ 


M2 

subject to So uty = Siw) t=1,...,7,7=2,...,M1 (28.8) 
j=l 
M1 
-Soyf, 2-2, t=1,...,T, 7 =1,...,.M2—-1 (28.9) 
r= 2 
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For 2 fixed (28.7), (28.8) and (28.9) is separable in T subproblems, so 


Q(z,0) = y2 Q' (2,0) where 
t 


Q (2,0) = int (D 2 sty (5 Je fy >0} (28.10) 


where N and 6 are the coefficient matrix and right hand side of the system 


2 
do vig = Si(o) 6 = 2,...,m1 (28.11) 


~ vig 2-2; G=1...5M2-1 (28.12) 


The constraints of the subproblem, namely (28.11) and (28.12) are not 
written in standard transportation format. We therefore introduce a dummy 
supply point, supply point 1, and let c;; = 0 for all 7. 

Furthermore we change the inequality signs in (28.12) to equalities and let 


M2-1 
St(w) = max 4 0, a zy Sst 
j=l 
M2-1 
og (w) = max 05st) YS 4 
j=l 


Thereby we get (leaving out the indices w and t) 
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minimize ye pe CEYiz 


subject to yi; = 5; i1=2,...,M1 (28.11’) 


Since the constraints of a transportation network are linearly dependent, we 
have omitted the equation for supply point 1. 
The dual of this is 


M1 M2 
maximize Ne S;a(8;) — > 2; (z;) 
i=2 j=1 


subject to a(e;)—a(x;) Scjy 2 =2,...,M1,jg=1,...,M2 
—a(z;)<Sej;=0 f=1,...,M2 
a(e:),7(2;) unrestricted in sign. 
But these constraints can be rewritten as 


a(e;)—a(2;) Sc t=2,...,M1,j =1,...,M2 


28.13 
a(a;)20 j=1,...,M2 ( ) 
a(8;) unrestricted in sign. 
If we take the dual once more we get: 
M1 M2 
min) > Yo casyis 
f=2 j7=1 
M2 
Soup =S; §=2,...,M1 (28.11”) 
J=1 
M1 
—Souiy 2-2; fF =1,-..,M2 (28.12") 
=2 
Why 20 


Except for the constraint for 7 = M2 this is equal to (28.11) and (28.12) 
but since cjaj2 > cy; for 7 = 1,...,M2— 1, we know that leaving out the 
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inequality for 7 = M2 in (28.12”) will not alter the solution. Therefore solving 
problem (28.11’) and (28.12’) instead of (28.11) and (28.12) give the correct 
dual variables. 

Alternatively we can say that since relaxing constraint M2 in (28.12”) to 
(aso + €) will not make any difference (no flow will be moved from any of 
the other demand nodes), 7(2as2) = 0. By putting 7(za¢2) = 0 into (28.13) 
and then taking the dual, we will get (28.11) and (28.12). The conclusion is 
therefore: 


Remark: By introducing a dummy supply node into (28.11) and (28.12), mak- 
ing sure that supply equals demand and letting all inequalities be equalities, we 
get a transportation problem for which the dual variables coincide with those 
of (28.11) and (28.12). 

From now on, we will use formulation (28.11’) and (28.12’). We will call 
the coefficient matrix N and the right hand side 6 although the number of rows 
have increased by one. 

Assuming that w has a finite number K of possible outcomes, with p; the 
probability of outcome k, Q(x) becomes 


K T 
Q(2) = > pe 35 QO (z,we) 
k=1 t=1 
where Q'(z,u,) now is 
inf} 7 >> cisyis|Ny = by 2 0 (28.10’) 
fog 
The dual of (28.10') is given by 
P'(2,w,) = sup{zb'|7N < c} (28.14) 


Let yo be the optimal solution to (28.10’) and 79 the optimal solution to (28.14). 
Then 7b! = cyo, ie. P! (2,0) = Q'(2,w). 
So therefore 


K T 
Q(2) = Spe S> Pt (2, we) (28.15) 


If N, is the optimal basis, we now that 7) = coN,' , ie. it is not a function of 
the right hand side b'. Therefore if b{ and bf both have the same optimal basis, 
they have equal dual variables, which can be utilized in (28.15) by bunching all 
possible right hand sides of (28.15) which have the same optimal basis. (There is 
a total of T-K right hand sides). How this can efficiently be done is explained in 
[17]. Let z{(e,) be the optimal value of the variable in (28.14) that corresponds 
to supply point 7 and 7j,(z,;) the same for the demand point J. 
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Following formula (2.20) of [18], we get the following optimality cut if a 
given z is not optimal. 


Ve+6>v (28.16) 
where 
K T 
V= S- pe Sat (2) (28.17) 
k=1 t=1 
and 


K T 
v= »u Pk d a'.(8) S* (we) 


If pp = % which often will be quite reasonable, then 


K T 
=x > > at (2) (28.16’) 
k1t=1 
K T 
ae 3 Saf (2) 5? (we) (28.17’) 
k-1t=1 
in which case (28.16) can be written as 
V2+KO>0' 


with all coefficients integer provided ¢;; is integer. 

We have already shown that a(z;) > 0 for all 7. It is also easy to demon- 
strate that 7(s;) is greater or equal to zero. Note that since cj; > 0 for all ¢ 
and j and since we are minimizing, (28.11) and (28.12) can be rewritten as 


2 
Yo viz 2 Sw) i= 2,...,M1 


So vis 2-2; g=l,...,.M2-1 


This is clearly true since we always will try to send as little as possible, forcing 
equality in the supply constraints. 
The dual of this is (using the objective function (28.7)): 


M1 M3 
maximize os St (w)a*(6;) — > a,‘ (z;) 
i=2 j=l 
subject to 7‘(8;) — 2‘(2;) < ci; 
a (8;) 20 
a*(2;) 20 


a'(e;) =90 
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From this it follows that all elements in V are nonnegative and v is positive in 
(28.16). Furthermore Q(z) =v —Vz. 

A disadvantage of this method is that even if the number of used demand 
points is low, we solve a transportation problem of full size. Other approaches, 
such as [6] avoid this, but if we were to follow these methods we would loose 
other advantages, such as the efficiency of the dual decomposition outlined in 
[17]. 

We then turn to: 


28.8 The Equivalent Deterministic Program 


If w has a finite number of outcomes, Q(z), the subproblem, will be polyhedral 
in z, [18]. Thereby (28.1)-(28.5) can be written equivalently as 


minimize _h,(2;) +0 (28.19) 
subject to V,z+@>v, #=1,...,R (28.20) 
O<2<d 


We call (28.19)-(28.20) the equivalent deterministic program. It has the same 
solution as (28.1)—(28.5). The constraints V,z +9 > v, are of the form (28.16) 
generated by the subproblem. 

We will next define the relaxed deterministic problem as 


minimize S~ h;(2;) +0 (28.19) 
Jj 
subject to Vaz +0>v, ¢=1,...,7 (28.21) 
O<a<d 


The relaxed deterministic program is a part of an iteration in order to ap- 
proximate the equivalent deterministic program. The iteration can be stated 
as follows: Pick an arbitrary (reasonable) 2°. Solve the subproblem, i.e. find 
Q(z°). Determine an optimality cut of the type (28.16), and solve (28.19), 
(28.21) setting r = 1 and i= 1. 

Let 2’, 6! be the optimal solution fo (28.19), (28.21), Then find Q(z‘). 
If Q(2') < 6! stop, otherwise construct a new optimality cut of type (28.16), 
increase 7 and 2 by one and resolve. 

The program can also be started by a number of (intelligent) guesses 2’, 
4=1,...,7 if such are available. This fits the idea of nonstochastic tenders in 
[13]. 

Since (28.19) is assumed to be quasi-concave, we know that the solution to 
the relaxed deterministic problem can be found in one of the extreme points of 
(28.21). We will therefore use an idea presented in [10], although the algorithm 
as such can be found in [4] and [14]. This is an exact method. In the next 
section we will present a heuristic approach. 

The algorithm is based on the following propositions: 
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Proposition 1. The k-th best extreme point in the set of feasible solutions to 
an LP will always be the neighbor of one of the (k — 1) best. 


Proposition 2. Given the graph G with one node for each extreme point and 
one arc for each pair of neighbor extreme points. Then there is a path from 
any node in the graph to the root node (representing the optimal solution) on 
which the objective function is nonincreasing. 


If the purpose (as in [14]) is to find the & best extreme points in the 
set of feasible solutions, the algorithm goes as follows, based on Proposition 
1. (Assuming here that degeneracy does not occur, this only to make the 
presentation simpler.) 

(i) Find the optimal solution. Let t = 1. 
ii) Find the (f+ 1)-st best extreme point as the neighbor of one of the ¢ best. 
(iii) Increase ¢ by one. If t = 4, stop. Otherwise go to (ii). 

In order to sove step (ii), one will usually store some information about 
all the neighbors of the ¢ best extreme points. Otherwise the amount of work 
will be too large. Therefore in step (iii), before returning to step (ii) one must 
calculate the appropriate information about those neighbors of extreme point ¢ 
that have not already been found. 

If the algorithm proceeds as explained above, it easily follows from Propo- 
sition 1 and 2. 


Proposition 8. The value of the objective function for all nodes created in 
step (iii) will be at least as high as for node t. 


We now show how by relying on the above proposition and the preced- 
ing algorithm, we can solve our problem which has a quasi-concave objective 
function. 

Assume as before that 


~ _ fo if 2; =0 
h;(z;) a a +hjz; if 2; > 0. 


Then as first step of the algorithm solve: 
minimize h;2; 
subject to V,e+0>v, #@=1,...,7 (28.22) 


O<2K<d 


An optimal solution to this is easy to find, and it is called extreme point 1. The 
variables z; and w;, defined below are associated with this solution 


|= ) hjeij 
J 


Two-Stage Facility Location Problem 499 


and 


v= S° a; 
By 


where B, is the set of all postitve basic variables for extreme point 1. 

A variable WMIN is given as a lower bound on w. If no a priori informa: 
tion exist, WMIN = 0 can always be used. 

Let FOPT = 2, +w,;, OPTEX = 1. Find all the neighbors of node 
one for which z; < FOPT — WMIN . Put them into a list D sorted after 
increasing values of z;. 

In a systematic way we now examine the rest of the extreme points of 
(28.22). Assume at some step that we have found the & best extreme points, 
(i.e. according to the objective function of (28.22)). Assume 


a + we = jtpin fas +w;} 


such that FOPT = 4+; and OPTEX =. 

Next pick the first node in the list D. If D is empty, OPTEX is the optimal 
extreme point. Otherwise this node represents extreme point (k +1). Check if 
%e+1 + We+1 < FOPT. If that is true, let OPTEX =4+1 and FOPT = 
Zk+1+wp+1, and delete from the list D all nodes with z; > FOPT — WMIN . 

Increase & by one and repeat. 

Clearly, a good approximation of WMIN is crucial for the speed of conver- 
gence. 

With linear objective functions, one would always expect that the last cut 
generated will be binding in the next iteration. With functions like ours, this 
will in general not be the case. The following way of rewriting the relaxed 
deterministic problem is therefore not valid. 


minimize 5” h;(z;) +8 

subject to V,zt+?>v, s=1,...,r—-1 
Vea +0 >, 
O<2e<d 


The main disadvantage of this method so far is that we will expect z to 
change little from one step to the next. Therefore the optimal x from the 
previous step is likely to be a good guess. The problem is, however, that by 
using this 2 as a starting point (using a few dual simplex steps) we have no 
stopping criterion, although the optimal solution might be just a few pivots 
away. We therefore suggest the following approach. 

Take the x from the previous step. It will represent a primal infeasible but 
dual feasible solution. Use a dual method to find a primal feasible solution. 
The number of steps will probably be rather low. Denote this extreme point 
by 0 (zero). Find wo, 29 and let FOPT = 2 + wo and OPTEX =0. Then 
start the main procedure. 
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The advantage is now that provided extreme point 0 is a good guess (which 
most likely is not the case for extreme point 1) the number of nodes needed to 
find the optimal solution will decrease since the check for whether or not nodes 
in the list D can be deleted is likely to be more powerful. This idea has been 
tried, and the results are outlined in Section 28.8. 

It is possible for the number of cuts of the type (28.10) to become very 
large. But due to a result by Murty [11] we only need to keep a maximum of 
M1 +M2-— 1 (the number of rows in the node-arc incidence matrix). Theo- 
retically we can, therefore, in each step drop all nonbinding constraints. Our 
experience, however, is that such an approach is extremely difficult, due to the 
unstructured behavior of the quasi-concave objective function. Even dropping 
only one hyperplane is difficult, since the hyperplane with the largest slack in 
one iteration, easily becomes binding in the next. We return to this in Section 
28.8. 

As an alternative to the extreme point enumeration we present another 
method which must be viewed as heuristic. 


28.4 A Heuristic Approach to Solve the Relaxed Program 


As we will outline further in a later section, the method described in the previous 
section for solving the relaxed deterministic problem is not very efficient for this 
specific problem. 

In this section we therefore present a heuristic approach based on cardi- 
nality constraints, [8]. Very loosely, the idea can be expressed as follows: 


Idea. The relaxed deterministic problem (28.19), (28.21) will for reasonable 
values of h; and H; be unimodal in k, the number of plants. 

We will return to the problem of determining when unimodality is present, 
but assume so far that this is actually the case. 

The advantage of this approach is that we can solve arelaxed deterministic 
problem using the cost function )> jz; instead of the much more complicated 
27h; (a;). 

The algorithm which is based on a series of LP’s can take on two different 
forms depending on which of the following questions we ask: 


— What is the best structure given that we use exactly & plants? 
— What is the best structure given that we use no more than & plants? 


The first of these questions can be answered by checking all possibilities 
of & plants, but always using an extra cut such that the LP-code only finds a 
feasible solution if the current combination of & plants is the best so far (since 
it is faster to find infeasibility than optimality of a feasible problem). The extra 
cut is made from the coefficients of the objective function. 

The second question above can, provided H; = H, be solved using a car- 
dinality constrained LP, see [8]. Both methods are exponential. 
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Provided we solve a series of LP’s, these are basically two approaches: 


— Solve the problem sequentially until we find a & such that > h;(2e+41,;) > 
Dh; (zx;) where 2,; is the size of the j-th potential plant given that we 
have k plants. 

— Use a golden section search on k. 


If the cardinality constrained LP is used, we can solve sequentially with the 
extra constraint: 

N(z)<k (28.23) 
where N(x) is the number of open plants, until (28.23) is not binding. Then 


the solution found in the previous step is optimal. Or we can do a bisection on 
k. A bisection on & will need a maximum of log, (M2 — 1) steps. 


28.5 Complexity of the Relaxed Equivalent Problem 


In this section we show that the equivalent deterministic problem (28.19), 
(28.21) is NP-hard. For a detailed treatment of NP-problems, see [7]. Problem 
(28.19), (28.21) is clearly equivalent to the following mixed zero-one integer 
programming (IP) problem. Given h, H, V, 6 and D as nonnegative rational 
matrices and vectors, find 


minimize he+Hy+0 
subject to V2z+10>5 


2< Dy 
ye{o,1yve-t x 
re gen 
ere 
The recognition version of the above mixed zero-one IP called (MIP) 


is 
“Does ha + Hy +0 <M and (z,y,0) € X have a solution?” 
We will show the following theorem. 


Theorem. (MIP) is NP-complete. 


Proof: First we show that (MIP) is in the class NP. Any feasible solution to 
(MIP) will have y € {0,1}/?-!. The remaining components are determined by 
an LP in the original coefficients. Since LP is in the class P which is a subset 
of NP the result follows. 

We complete the proof by showing that there exists an NP-complete prob- 
lem that transforms polynomially into (MIP). 

The zero-one IP 

“Given A and 4, does Ay > 6, y € {0,1}*4?—! have a solution?” 
is NP-complete even if A and 6 are restricted to nonnegative entries. Let an 
arbitrary instance of zero-one IP be given by A and 8. We show how to construct 
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in polynomial time an instance J of (IMP) such that the zero-one IP has a 
solution if and only if J has a solution. 

Lett h = H =0,V =A, D=TI1 and M = 0. It is immediate that if the 
zero-one IP has the solution y* then J has the solution 2 = 7 = y* and 6 =0. 
Conversely, if J has a solution ¢, §, 8 then AY > AZ > 6-10 > b. Hence y* =9 
is a solution to the zero-one IP. Q.E.D. 

This result justifies the use of exponential algorithrns to solve the relaxed 
deterministic problem. 


28.6 An Alternative Approach 


In the previous section we outlined a heuristic method for solving the relaxed 
deterministic problem. Based on the idea of unimodality another heuristic 
approach is reasonable to try. The idea is: 


Idea: The two-stage stochastic facility-location problem (28.1)-(28.5) is uni- 
modal in k, the number of plants. 
As for the relaxed deterministic problem we can again either perform: 


— bisection by adding (28.23) 
— linear search by adding (28.23) 


— golden search by adding 
N(2)=k (28.24) 


— linear search by adding (28.24) 


In each of these cases we will decompose (28.1)-(28.5) and (28.23) or 
(28.24) as explained in the previous sections into a relaxed deterministic prob- 
lem and a stochastic transportation problem. But now the relaxed deterministic 
problem only has to be solved for one value of k. 

Note that also here H; = H is necessary if the cardinality-constrained LP 
is to be used. 

By this method we have moved the iteration on k from an inner loop to an 
outer loop. It is not clear to us which approach is best. The approach in this 
section, however, has the advantage that for each k not found to be optimal we 
get information about the optimal structure for that specific value of k. (With 
(28.23) this is only true for k smaller than the optimal value.) 

We then turn to the problem of determining when the problem (28.1)- 
(28.5) is unimodal in &, the number of plants. 

If unimodality with respect to minimization is not present, we must have 
a situation like Figure 28.1. 

In the following we will assume that H; = H and h; =/h,; i.e. all potential 
plants have the same cost structure. Let x; be the optimal plant structure with 
k plants and d, the total expected transportation costs associated with it. 

Mathematically lack of unimodality means that 


kKH +h) an; + de > (k- YA+hY > a1; + de 


3 3 
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Total 
cost 


k-1 k k+1 # plants 


Figure 28.1 A situation violating unimodality with respect to minimization. 


kKH +h) apy td > (RKtV Ath testy tess 
7 3 


We add to get 


1 
AD aes +d, > gl alert ees) + dy-1 +de+1] (28.25) 


Following [2, p. 204], inequality (28.25) is the exact definition of concavity over 
integers. Let f;, = hy; thy +dy. We then get 


Proposition 3. The problem (28.1)-(28.5) is unimodal in k, the number of 
plants, provided f,, is strictly convex over the integers |1, M2 — 1]. 


The above proposition is clearly not necessary. There are certain concavity 
situations which are acceptable. We will not go into details here, just note the 


following: 
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— If sequential search on & is performed, we only need to require that the 
global minimum is the left-most local minimum. 

— With bisection or golden search concavity in f, can be acceptable, but 
(28.1)-(28.5) must be unimodal. 


28.7 Example 
The purpose of this example is to investigate the following question: 


“If the Norwegian fish meal and fish oil industry were to be estab- 
lished today, what would the plant structure in southern Norway be, 
provided we assume there are enough vessels available?” 


The reason for asking such a question is the structure of today’s industry. 
All the plants we have today are both small and old. Compared to e.g. Den- 
mark, even our largest plant is small, and the largest Norwegian plant (which 
is in northern Norway) is almost twice as large as the second largest plant. 

Since many of the existing plants are very old, the action of building a 
new plant (of the size suggested in this report) will not be very different from 
rebuilding one of the old plants. 

What is wrong with this approach, however, is that in the short run the 
fixed cost of an existing plant is lower than assumed here since the alternative 
value of it is most often close to zero. But in the long run, one will not reinvest 
the amount needed to maintain the plant unless the profit is as good as else- 
where. Therefore the approach can be considered appropriate at least in the 
long run. 
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Fishing grounds 


We have assumed 5 different fisheries, taking place at 14 fishing grounds. Po- 
sition, quotas and fishing seasons are based on the situation over the last few 
years. The 5 fisheries are given in the table below. 


Table 28.1 Expected values and standard deviation for quotas, fishing sea- 
sons and positions for the 5 fisheries used in this article. Quotas are measured 


in hectoliters. 





Expected 
Fishery Quota 


Mackerel 250.000 
(Scomber 
scombrus) 


Blue whiting | 410.000 
(Gladus 820.000 
poutassou) 410.000 

410.000 


Sprat 170.000 
(Clupea 400.000 
Sprattus) 


Sand-eel 75.000 
225.000 

75.000 

125.000 


80.000 
1200.000 
320.000 


Position 
(+=East) 











75.000 
100.000 








59.4 -18.0 
59.4 -10.2 
60.0 - 4.0 
61.5 — 1.0 
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Fish meal plants 


We have assumed 11 potential plants along the coast of western Norway. The 
table below shows their positions. 


Table 28.2 Region and position for 11 potential plants of southern Norway. 





Position 
Position 
Sunnm¢re 5.0 
Nordfjord 5.0 
Sunnfjord 5.2 
Ytre-Sogn 5.0 


Nordhordaland 5.0 
Bergen 5.3 
Sunnhordland 5.1 
Nordrogaland 5.3 
Stavanger 5.7 
Flekkefj ord/ Egersund 6.3 
Lindesnes 7.5 








Fized costs for the plants 
The fixed costs for a plant consist in general of two parts: 


(1) The cost related to maintenance in order to keep the plant as new. This 
should include what is needed to update the equipment technologically. 
(2) Alternative cost for the capital bound in the plant. 


(2) will be different depending on whether we consider an old plant or will 
build a new one. If there is no alternative use of an old plant, the alternative 
cost will be zero. 

In this report we only consider building new plants, i.e. we consider the 
problem: What would we do if we were to establish the Norwegian fish-meal 
industry today. Based on data presented in [15] we have found the following 
linear approximation of the sum of (1) and (2) above. 


FIX = 1.84+0.182 


where z is the capacity measured in number of loads (each 5000 hl) per month. 
The cost is measured in millions of NOK. 
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Transportation costs 


The transportation costs per load are calculated between each pair of fishing 
grounds and plants. The vessels’ use of fuel is found according to formulas 
presented in [8] on the basis of a chosen speed and vessel size. 

The cost of bringing one load to the demand point representing the recourse 
action of going to Denmark is calculated as follows: 


1. As pure transportation cost, use the most expensive within the country, 
le. max cj;. 
3 
2. Add a fixed cost per hl. This cost is meant to reflect the loss due to the 
fact that Denmark gets the profit from processing and selling the fish. 


We have used data from a simulation reported in [16], but we have raised 
the world market prices for fish meal and fish oil to NOK 3.30/kg and NOK 
2.71/kg, respectively. Furthermore, we have set the alternative value of labor 
to zero, to reflect that the alternative jobs do not exist in rural Norway. This 
gives us a loss of NOK 32 per hl when the fish is sent to Denmark. 


Calculating the quotas 


The quota for a certain fishery at a given position is found as follows. 

First a total quota is found from a normal distribution with expectation 
and variance as given in Table 28.1. lf the quota is smaller than pu — 20, it is 
set equal to this value. If the quota is larger than 4 +20, it is set equal to that 
value. 

The quota is in the input distributed over a set of time period. a,j, = 0.5 
means that 50 percent of the quota of fishery ¢ will be caught in time period ¢, 
i.e. we expect to catch ;a;;. The amount allocated to fishery z in time period 
¢ is then drawn for a normal distribution with expectation y;a;; and variance 
bjt; again avoiding outliers as above. Note that this value is independent 
of the actual] quota found above. };: gives the standard deviation as a fraction 
of ayHi. 

In this way we might experience both fisheries that finish before planned 
and fisheries that run out of time. 

This process is repeated a number of times to obtain several right hand 
sides. 
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Results 


With the data given, the solution turns out to be that only one plant should 
be constructed. The formal solution is to have one plant in the Stavanger area. 
Clearly this solution, being robust toward changes in quotas, might be nonro- 
bust toward changed positions of the fisheries. We should therefore conclude as 
follows. 

One plant with a capacity of 46.500 hl/day should be built somewhere 
between Haugesund and Egersund. 

A plant of this size will be almost 3 times as large as the largest existing 
plant in Norway. 

We would again like to point out that this is a long-run result. In the short 
run the alternative value of an existing plant is much lower than the 7 percent of 
the “new value” we have used, so in the short run many of the existing plants 
should be kept. In the long run, however, one will not reinvest. in these old 
plants (since there are better alternatives). The result above, given the input, 
is therefore the long run goal, which is stable toward changes in quotas. 

It must be admitted that this result is rather surprising. Below we stress 
some shortcomings of the model, and show in which direction they would move 
the solution. 


(a) Aspects strengthening the one-plant solution 
— We have not assumed any economy of scale in the variable part of the 
fixed cost. Hence if we let 


hi (z;) = A; + h,(z;) 


where h,(z;) is concave, the tendency towards one plant would be 
strengthened. 

~— We have not been able to model the fact that continuous production 
is advantageous (as reported in [15]), and that a low number of plants 
means a high level of continuity. 

~ Changes in positions of fishing ground could be such that the spread 
decreases, strengthening the one-plant solution. 


(b) Aspects weakening the one-plant solution. 

~ The fishing grounds could be more spread than assumed. 

- The alternative value of labor could differ between the potential sites 
of the plants, making it cheap to establish several plants at low cost 
sites. (But still we easily get a one-plant solution at a low cost site.) 

~ The 4;’s can be overestimated. 
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28.8 Computational Experiences 


We shall start by reporting our experience with the method used to solve the 
subproblem, i.e. the stochastic transportation model. The method used is as 
mentioned outlined in [17], where we reported good computational experiences. 
The idea is to decompose the requirement space of the transportation problem 
into a set of polyhedral cones. These cones represent all possible optimal bases 
(whichever right hand side is used) and they are all dual feasible. Going from 
one cone to its neighbor is equivalent to taking a dual step of the simplex 
algorithm. 

As long as the problem at hand is relatively small, all the cones can be 
generated, and the method is extremely efficient. If, however, the problem 
is large (as ours with 15 supply points and 12 demand points) the number 
of cones is so huge that one can only create some of them. With unimodal 
distributions, our clear advice in [17] was to create cones in “circles” around 
the one containing the expected values of the uncertain right hand sides, since 
these cones in some sense are large. The example treated in this report, however, 
does not subsume this unimodality condition. Therefore, although we created 
4000 cones, only a few percent of the right hand sides fell into these cones (with 
an optimal solution with a larger number of plants, the number could probably 
have been better, since “the expected value of the right hand side” was assumed 
to be “all plants open”). 

Each of the 996 transportation problems solved in the subproblem took 
approximately 0.4 second CPU time, which is reasonable for a 15 x 12 system. 
This was despite the fact that for almost all the right hand sides the dual method 
used to find the optimal solution if none of the 4000 cones were optimal, had 
to be called. Most of the time was used in this subroutine. 

Note that this does not mean that we could as well have dropped the dual 
decomposition. The cones still represent a set of very efficient dual steps. 

There are two important questions when solving the relaxed deterministic 
problem. One is which method to use once the problem is established. The 
other one, which to us seems to be extremely important and difficult, is which 
hyperplanes to drop. Despite the result of Murty [11], dropping all nonbinding 
hyperplanes is not practical at all. On the other hand, one must limit the 
number of hyperplanes to keep a manageable problem. The reason for the 
problem is the unstructured behavior of the objective function. The example 
below shows how the extreme points can be ordered for a very simple example. 


If we solve this problem, we find the optimal solution z} = 10, 72 =0. A 
cut is therefore created, forcing (10,0) out of the feasible region. If a hyperplane 
is to be dropped on the basis of the largest slack, we will drop the plane going 
through (0,8) and (1,4). The next optimization will then bring us to (0,5). We 
therefore see that the hyperplane we dropped was the only one that would make 
the objective function decrease instead of increase. The hyperplane through 
(0,8), will again be added, and we must drop another hyperplane. With the 
given rule, the newly created hyperplane that removed (10,0) from the feasible 
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Figure 28.2. Ordering of extreme points when H; = 10, Hy = 8, hy = 1 and 
ha = 2, 


region will be dropped. We have entered a situation where we alternate between 
(0,5) and (10,0). 

Measures can be taken to avoid this lack of convergence, e.g., redoing the 
dropping if the objective function does not increase. The example, however, 
very well illustrates the problems inherent in this kind of objective function. 

Very closely related to the problem of dropping hyperplanes is which method 
to use to solve the relaxed deterministic problem. The reason is that the two 
methods we have outlined (extreme point enumeration and solving a series of 
LP’s) react differently with respect to increases in the number of constraints. 

Instead of giving a general description of these methods, we will only outline 
our experience as a result of the example in the previous section, where the 
optimal number of plants was low. 
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The problem had 13 variables and 13 slacks. When employing the extreme 
point enumeration technique we tried to drop hyperplanes when the number of 
them exceeded 13. It turned out that with these 26 variables and 13 constraints 
it took literally hours to solve one iteration. This is not mainly due to the 
complexity of extreme point enumeration, but because all hyperplanes were 
almost parallel as the example in Figure 28.3 shows. 


Figure 28.3. Example showing how the hyperplanes tend to become almost 
parallel as the iteration proceeds. 


Therefore the costs }) h;2;, on which the enumeration is based, are almost 
the same in the vast majority of extreme points. Thus the procedure that 
deletes extreme points from the list of extreme points to be examined is almost 
without any power, i.e. we tend to examine almost all extreme points in the 
polyhedron. 
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Note that the problems outlined above do not exclude extreme point enu- 
meration methods when the polyhedron has a more normal form, because then 
the deletion of extreme points from the list is likely to be much more powerful. 

It is reasonable to believe that the enumeration would have worked better 
if the optimal number of plants had been higher, since that would have tended 
to obtain fewer parallel hyperplanes. 

The method improved a little when we started by defining a node 0 in 
order to strengthen the deletion procedure, but not very much, although we 
put some effort into getting a good node 0. 

We also tested the method based on unimodality. Since the number of 
plants in the solution was low, this method was very efficient. It converged 
very fast even when we let the number of constraints increase to 26. With 
26 constraints the main iteration converged, so we did not have to drop any 
hyperplanes. 

If the optimal number of plants had been around M2—1 this method would 
clearly not be very efficient since an exponential number of LP’s would have 
had to be solved in each main iteration. 

We have not tested Holm’s cardinality constrained method [8}, but it should 
be tried since it is likely to be quite efficient here even though it is also an 
exponential method. 

We have also not yet examined the possibility of using an outer iteration 
scheme on k, the number of plants. 
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CHAPTER 29 


SOME TEST PROBLEMS FOR STOCHASTIC 
NONLINEAR MULTISTAGE PROGRAMS 


X. de Groote, M.C. Noél and Y. Smeers 


29.1 Introduction 


Few algorithms exist for handling multistage nonlinear programming problems 
with recourse. It is thus reasonable to provide, at this stage, test. problems 
that can be handled without sophisticated implementations but still offer a 
sufficiently broad range of complexity. 

We propose in this paper different economic growth models that can be 
used for testing algorithms for nonlinear multistage programming models. All 
problems are variations of the nonlinear part of Manne’s energy economy model 
ETA-MACRO ([3], [38]). 

This set of test problems offers the following advantages: 


(i) The models are quite simple in terms of rows and variables in each period 
and for each event. They have been benchmarked for the case of the 
European Community and thus, provide in some sense a set of (very much 
related) realistic problems. 

(ii) The models are ranked in order of increasing complexity. This is to be 
meant not only in terms of the number of rows and equations, number of 
periods or number of events, but also with respect to the nonlinearities 
that they contain and the modeling of the recourse that they imply. 

(iii) The data required by the models are reduced to a minimum. We provide 
in this paper all details necessary for setting up the problems. 


The paper is organized as follows. Section 2 describes the deterministic ver- 
sions of the models. Section 3 provides the required data, analyzes the numerical 
behavior of the different deterministic modes and introduces the construction 
of the stochastic versions of the problems. Section 4 is more specific to our 
implementation in the sense that it gives the dual of the different models; this 
could be relevant for other algorithms that use primal and dual information. 
Finally, the numerical results are discussed in the last section. 
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29.2 The Test Problems: Deterministic Forms 


The problems considered in this paper have been constructed from the energy- 
economy model ETA-MACRO developed by Manne and his coauthors since 
1977 ([1]; see [2] and [3] for more recent developments). ETA-MACRO assumes 
a two sector representation of the economy. The energy sector is described by 
process analysis while the rest of the economy is represented by a production 
function. We consider two simplified versions of ETA-MACRO; in the first 
one, noted B, (basic), the representation of the energy sector is reduced to the 
production of electric and nonelectric energy, each of them by a single activity. 
The second simplified version of the model, noted E (electricity), recognizes 
both a capital and operations variable for the production of electricity (the 
production of nonelectric energy being still represented by a single operations 
variable). Besides the fact that they lead to models with different number of 
constraints and variables, these problems also present variations of formulation 
that are interesting from the point of view of stochastic programming. In 
particular, the long construction time assumed for the stock of capital in power 
generation reduces the recourse possibilities of the energy sectors. 

ETA-MACRO is formulated as a putty-clay model; perfect malleability is 
assumed for the new capital stock while the production structure of the old 
capital stock is fixed. An alternative approach, which is less realistic, is to 
suppose perfect malleability of the whole capital stock. The distinction, which 
is quite important from the point of view of economic modeling is also relevant 
in the context of stochastic programming where the putty-clay model offers less 
recourse than the putty-putty one. Each of the two models B and E will be 
considered a putty-clay (PC) and putty-putty (PP) version. We thus present 
a total of four models, which all deal with the same system but correspond to 
different degrees of realisrn, number of constraints and variables, and numerical 
difficulties, 

We now describe these models in more details. 
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Model A (Basic Putty-Clay) 

The output of the economy in period t, ¥;, is decomposed into a contribution 
due to the existing capital stock and an additional Y N; due to the capacity 
becoming available in t. Following ETA-MACRO the contribution Y N; is con- 
structed as 


YN, = [- (ex>wni-*)’ +6 («nezny-*)’] ; (A.1) 


where EN, NN; and LN; are the inputs in electric energy, nonelectric energy 
and labor consumed by the new capital stock KN. 

The total output of the economy and its consumption of capital, labor, 
electric and nonelectric energy are then given by the relations: 


¥; —A¥;-1 -YN =0 (A.2) 
Ky — \Ki-1 — KN; = 0 (A.3) 
Ey — \Ey-1 — EN; =0 (A.4) 
Np — \Ne-1 ~ NN; = 0 (A.5) 


where \ is the decay rate of the existing capital stock over one period (usually 
several years). 
These equations are written for all ¢ from 2 to the end of the horizon T. 
In order to link KN; to the investments, we introduce the additional rela- 
tions for t = 2,...,T, 


K, om AKe-1 = al; = Bh =0 (A.6) 


where § and J;_; are the investments in the current and preceding period; a 
and # are their respective contribution to the capital stock of period ¢. 

Asin ETA-MACRO the global output of the economy is allocated to private 
consumption, investments and the input of electric and nonelectric energy; this 
is expressed as: 

Op +e + perce; + preN; —¥; =0 (A.7) 


where pe; and pn; are respectively the unitary input of electric and nonelectric 
energy in period ¢. 

In contrast with ETA-MACRO, we do not disaggregate the expressions 
pe: FE, and pn; N; into their components as in an energy model. Needless to say 
this could be done later in order to investigate the behavior of larger stochastic 
models; such an extension would however go beyond the scope of this paper. 

We conclude the description of this first model by giving the objective 
function and the terminal condition. As in ETA-MACRO, it is assumed that 
the system is geared by a multitemporal utility function: 


T24 
by p' log C; + 


t=1 


log Cr (A.8) 





1-p 
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where the first T — 1 terms deal with private consumption during the beginning 
of the horizon; the last term accounts for end effects as discussed below. 

End effects are dealt with by assuming that the economy is growing at a 
rate g after the horizon; investments in period T must be sufficient to guarantee 
this growth of the stock of capital after accounting for equipment decay. This 
is expressed by the relation: 


—Ip+(l+g—A4*)Kr =0. (A.9) 


where \4 is the annual decay rate. 

The model is operated on a horizon decomposed in five year periods. Data 
for the European Community have been adapted from Rogner et al. ({5]) and 
are given in the next section with a discussion of the initial conditions. 

The second model (B - PP) supposes a putty-putty description of the 
economy; the capital stock is homogeneous and perfectly malleable in each 
period. This eliminates the need for distinguishing between new and old capital 
stock. The model is then written as follows: the global output of the economy 
Y; is given as 


v= fo (abe) =6(ment-*)']?. (0.1) 


Because the capital stock is completely malleable, we only need to describe its 
accumulation through time; this is done in the constraint: 


Ky = AKy-1 = ak, i Ble-i =; (B.2) 


The total output of the economy is similarly allocated between private con- 
sumption, investments and input for energy; this leads to an equation identical 
to {A.7) that we note (B.3); the objective function (A.8) and terminal condi- 
tion (A.9) are similarly unchanged and become the objective function (B.4) and 
terminal condition {B.5) of the new model. This is summarized below: 





Cr+ +p ke +puN —Y¥; =0 (B.3) 
T-1 pt 
p' log + log Cr (B.4) 
t=1 i= p 
—Ip+(1+g9—-4)Kr =0. (B.5) 


As will be discussed later, it is interesting to consider a version of the model 
where the power generation plants are represented with their construction lead 
time. We accordingly introduce a new capital stock for power generation and 
disaggregate the input of electric energy its the fuel and investment components. 
Taking up the putty-clay model first (E - PC), we maintain equation (A.1) that 
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gives the contribution of the new capital stock AN; to the gross output of the 
economy: 


YM = [- (enfnny-*)’ +6 («xpont-*)']’ ; (C.1) 


The total output of the economy and its consumption of capital, labor, electric 
and nonelectric energy are given as in model A: 


¥% -\Yi-1 -Y¥N =0 (C.2) 

Ky —A\Ki_-1 ~ KN; =0 (C.3) 

Ey — \Ey-1 — EN, = 0 {C.4) 

Ne — A\Nt-1 -NN; =0 (0.5) 

K, - \Ky_ — aly — Bh, = 0.4 (C.6) 


New relations are introduced however to describe the evolution of the power 
generation system and the production of electricity: 


KE; — \eKEy-1 — aglE; — Belkk-) = 0t =2,...,T (C.7) 
Ey —deKE; < 0t—2,...,7. (C.8) 


The first relation describes the accumulation of the capital stock in the power 
generation sector (using a particular decay factor \g over the period) while 
(C.8) relates the production of electric energy to the installed capacity through 
a utilization rate dz. 

The allocation of the gross output of the economy is somewhat modified 
in order to account for the new representation of the power sector: 


Cy + Le + cegl Ey + pe, Ey + prieN: — Y¥; =0 (C.9) 


where ce; is the input in monetary units of the reference year of a unitary 
investment in the power generation sector; pe, is the average fuel cost of the 
power generation sector. 

The objective function and the terminal condition for the nonelectric cap- 
ital stock are identical to those of model A: 





T~1 at 

ye log CQ; + logCr (C.10) 
t=1 1— p 

-Ip+(1+g—M)Kr =0. (C.11) 


We introduce the last terminal condition for the capital stock of the electricity 


sector: 
-IErp +(1+g—Ap)KEr =0 (C.12) 


where A‘ is the annual decay factor of the power sector. 
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MODEL D (E-PP) 


The last model is the putty-putty version of model C. The assumptions under- 
lying its construction have already been discussed in the context of model B. 
We only list its relations. 

The global output of the economy is given by relation (B.1} and the evo- 
lution of the nonelectricity part of the capital stock by the expression (B.2} 


¥; = [e (ap ni-*)’ +b (xex-*)']" (D.1) 


Ky — \Ke-4 —oak — Bhr-1. (D.2) 


The accumulation of the power generation capacity and the relation between 
the existing capacity and the production of electric energy are given by: 


KE; —A\eKE,-) —anl& — BelEy-1 = 0t =1,...,T (D.3) 
E. -— dp KE; <0,t =2,...,T. (D.4) 


The rest of the model consists of the allocation of the gross output of the 
economy, the objective function and the end-effect conditions. Those are the 
same as in model C: 





Cette + cel&y + pe, + pneN;e — YY =0 (D.5) 
T-1 T 
y plogQ; + 7 log Cr (D.6) 
f=] 1—¢ 
—Ip+(l+g-A*)Kr =0 (D.7) 
—IEr +(1+9—AS)KEr =0 (D.8) 


In order to ease the manipulation of these models, the different relations are 
listed in Table 29.1 with an indication of the model where they appear. 
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Table 29.1 






Use in the 


Equations different models 


















YN, = [e (en? nj)’ +6 («ngun-*)’] j Ad Gu 


i= AKi-1 KN =0 


Ey — AR) —EN; =0 A.4 C.4 


Ky = AKe-1 = ak, — Bie-1 =0 A@ B.2 C.6 D.2 
A.7 B.3 


Cp +h + perk: + pnM —Y; =0 
wir +(ite=M)Ka=0 
K Ey ~ Ag K Ey ~ onl Et ~ Bp IEx-1 = 0 


Cy +d + cel + 02k; + pri —Y; = 0 
-IEr +(1+9—-¢)KEr =0 


y= [- (eins) +b (#13-*)’) , 


2 T 
Objective: ys pi log CO, + £;, log Cr 







































































A8’ B4 C10 D6 
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29.8 The Test Problems 


Different stochastic programs can be constructed on the basis of the models 
of section 2. Our test problems arise from considering the following economic 
situation. In the present oil glut, it is expected that energy prices will remain 
weak for some time, with the possible consequence that exploration activity may 
decrease in the near future; this could lead to a renewed dependence on OPEC 
with possibly a new tightening of the market in the mid-nineties. Our test 
problems are attempts to formalize the question of whether the economy should 
adapt right now to possible price increases in the future or should wait until they 
occur. In order to model that problem we shall assume that the evolution of the 
oil prices over the horizon is random and that it can be represented in extensive 
form by a binary tree. The two branches originating from a node respectively 
correspond to high and low price increases during the period. The tree is 
rooted in year 1980 and extends over 5, 7 or 9 five-year periods, depending on 
the problem. The price growths are 0 and 4occurring with an even probability. 
This corresponds to a median growth rate of 2IEW ([7]). This evolution is the 
only random element of the model; all other factors are supposed to be perfectly 
known; they can be described as follows. 

The initial values of the capital stock (electric KE, and nonelectric Ky ) 
are given as well as the consumption of electric (£1) and nonelectric Ny energy 
and the gross output of the economy (¥;) in the first period. Also known are 
the initial investments J; and JE. All these are given in Table 29.2 with the 
price and cost assumptions. 

The values of the other coefficients result from plain assumptions or from 
equilibrium conditions. The evolution of labor (Z and LN) is exogenous in each 
period; its growth rate is given in Table 29.3 with various parameters appearing 
in the production function and the constraints. The values have been selected 
from ([4]), ([5]) and ([6]). The remaining coefficients in the model are obtained 
from a benchmark at some equilibrium year. Consider the putty-putty model 
first, 

The first order condition, 


OY _ y1-pp hi yo0-D-1(y _ fi 

py NO0-B)-1ay =pn, 

aE (1—A)a=pn 
allows one to compute the coefficient a if all other values are known. 6 can then 
be derived by difference from: 


Yi-a (zon'-A)" 

b= Kae 

These calculations were done for the year 78 and the results summarized 

in Table 29.4. The coefficients derived for a putty-putty production function 

remain valid under a putty-clay assumption, if we assume that the consumption 
of the different factors is increasing at the same rate throughout. 
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No other numerical information is needed to specify the problems. 


Table 29.2 Initial values: base year 1980 





















Capital stock 10° $75 
Electric Energy 10'2 KWH 
Nonelectric Energy quads 
Gross output 10° $ 75 





Price of electric 25. 


energy 


10° $75/10!2 KWH 



















3.5 10° $ 75/quad 


Price of non- 
electric energy 
Labor 

Fuel cost of power 
generation sector 

















See table 4 
10° $75/10!2 KWH 


1.07743 
9.7945 













Cost of unitary 630853 10° $75/10° KW 
investment in power 


generation sector 






Investment in power 10° KW 


generation sector 





Capital stock in power 10° KW 


generation sector 








Table 29.8 Main coefficients of the model 


Electricity value share: B= .39 

Capital value share: a= 33 

Elasticity of substitution: o = .388 

Derived value of p: p= oot = —1.58 

Annual decay rate of the capital stock: 4 = .96 

Annual decay rate in the power sector: 4A = 967 

Coefficients describing the accumulation 

of capital: a=3;f=2 

Coefficients describing the accumulation 

of capital in the power sector: ap =0;fe =5 

Social discount rate: 6 = .06 F 

Derived value of p: p= (45) = .747 
Capacity utilization in the power sector: dg = .0052 (in GWH/KW) 
period 1 2 3 4 5 6 7 8 and others 


annual growth rate 3.4 3 3 2.7 2.7 2.5 2.5 2 
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Table 29.4 Evaluation of the coefficients a and 6 


Data for the reference equilibrium year 1978. 


Y = 1553 10° $75 

E = 1.02 10)? KWH 

K = 5668 10° $75 

N = 21.63 quads 

EL =1.0 (by definition) 
pn = 1.53 10° $75/quads 


Values of a and b 


= .63 10-5 
&=.80 10°? 
Value of Lego 
e 54 1/P 1/(l-& 
a 1-A\? 
i Yeo 74 (Bio Neo ‘) : 1.07743 
80 = —_— a ra = 1. . 
b Kg, 


29.4 The Dual Problems 


The test problems introduced in the preceding sections have all been handled 
by applying nested decomposition on their dual. The methodological approach 
has been presented at length in ([8]) and ([9]) and we shall not return to it 
here. This section will state the extensive form of the deterministic equivalents 
of the test problems, present their duals and discuss some of their features. 

Extensive forms are conveniently presented by referring to event trees. Let 
TR be this tree and ¢ be one of its nodes. S(2) is the set of successors of ¢ and 
P(i) its predecessor in TR. D(z) is the depth of node ¢ and x; its probability; 
the root of the tree is noted / and the set of the terminal nodes L. 

The writing of the dual problems involves a profit function Y* defined as: 


Y* (Ix , Iz, I) =max—IIl,K — PlpE -lyN+Y(K,E,N) 


where PI, , Ig and II y are positive scalars; Y is the production function where 
the value of Z has been fixed. Because Y is differentiable, the profit function 
can be computed by finding the solution vectors: 


k* = K (IIx, Ip, 0y) 


E* = E(IIx,Iz,Tn) 
Nt= N(ILx We, Ty) 
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of the inverted demand system 


oY 
aK 7 x 
OY 
aK le 
OY 
an 7 LN 


and substituting through. 

Although K*, E* and N* can be computed analytically, the derivation of 
an explicit expression of the profit function appears cumbersome. We shall thus 
always evaluate Y* (IIx, Iz,Iy) by numerical substitution and will note it as: 


Y* (Wx, We, My) = —PIxK* —WgE* —lyN* + Y(K*,E*,N*) 


in the rest of the paper. 
Because of the conjugacy condition, we also have: 





ay* 3 
On 
OY - as 
alle 
ay* 

= —N* 
ally 


which gives the gradient of the profit function. We can now proceed to state 
the problems. 


Basic Putty-Clay Model 





Primal Problem 
ae Di pP) 
minimize a Nip () log C; +R log C; 
ieTR jee ee 
t 


subject to C;+];+pe +h; +pn,N; -Y¥;< Uy tETR 
¥;- pi) -YNi SO ug iE TR IF1 
KN, —Ki+)Kpy) $0 ws ie TR, i #1 
EN; — Ey + Epi) $0 ug t€ TR UF1 
NN; -Nit+ ANpi) $0 ws tE€ TR, GF 1 
Kj — AK py) — ali — BI py $0 we TE TR, CFI 
~R+(ltg—-M)Ki <0 wr ieLb 
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Dual Problem 


minimize 


subject to 


where KN’, 


respectively. 


> - pros a +log - <i] 


teTR 
1¢L 
pPOn, 
5, -H#$— iting int 1] — [fi + pe FE, + pri Ni — Yi Jury 
fae. re are 


+ (8h +E) ( 2D ca iN ( DB « 


ies(1) ieS(1) 


— AN, > ua | —AK, > wis} +aXi > Uia 
i€S(1) iES(1) ieS(1) 
+ So [uie¥ N; —uisK Nf — uigEN? ~ ui5NN}] 


teTR 
tft 


uj) —auje—B >, ug 20 1ETR ALI L 
JEa(‘) 

peujr— ward Yuya 20 TETRAIAFILIEL 
Jea(') 

PNjUy) — Wi5A ys uj 20 tETRIAILIEL 
J€a(i) 

uj1— ua D> ua 20 CETR AFLG L 

JEa(s) 
Uiegr > Uuje — Hig +i Ug >0 tETR ALTE L 
JEa(') JES (i) 

Uj] — Aug —Ujy7 > 0 1EL 

peru —%420 1EL 

pryu, — us 20 1EL 

—ujy—ug>0 tEL 

ue — ug t(1+g —My)uir>0 ie€L 


EN, and NN; are computed using the prices = ait, an and ae 
i 
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Basic Putty-Putty Model 
Primal Problem 


minimize > 7p?) log C; +t 
‘GTR i€L 


log C; 





subject to OC; +1;+pe,E; +pn;N; -¥; <0 uy «cETR 
k; - AK pi) ~—al; -— BI pi) <0 ug tETR #1 
-[;+(l+g-AA)K; $0 uy teEL 


Dual Problem 


maximize a tj peor, ! + log — — Bt | 


‘eTR 
(¢L 
Di), = 
— 
+So- 1 +og (755 pti) Fs) 
teL 
—[ +p Fk, +p N, -Yi)ui1 + Ph +AKi » usa 
reS(1) 
+ 0 ua EE - ak: — pe E; - pin] 
(eTR 
Fl 
subject to uj—aua—f Yo ug >0 ieETRAFli¢L 
FES (i) 
dj-uatr D> ug=0 ieETRIFILIGL 
JES (i) 


Ujyy — Ug — ug 20 7EL 
dj — ta —(1+g—A4)uja =0 tEL 


where K;}, £7 and N,* are computed using the prices a, pe; and pr;. 
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Electricity Putty-Clay Model 
Primal Problem 





: Dt) 
maximize Ss 1p?) log C; + a m; r log C; 
‘EPR ees 
t 


subject to OC; +1; + cd)IE; + pe,E; + pnjN; ~Y; <0 uy tETR 

Y; — A¥p(iy) — YN; <0 ug tETRiF#1 

KN, — Ki + AK pi) <0 u3 tETRi#l 

EN, — E; +AEpy) <0 wig tETR i #1 

NN; ~ Ni +ANpciy <0 wis tETRifl 

Kj — AK pi) — al; — BI pi) <0 ue tETRIF1 

E; —dg KE; <0 a7 tETRifx~l 

KE; — \eKEpy) — aglk, — BrlEpy) <0 ug tETRif#~1 
~I;+(1+g9—-A4)K; <0 uo tEL 
—IE;+(1+9—-—A8)KE; <0 ti9 2EL 
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Dual Problem 


minimize a —pPOx [1 +10 — 5Dw | 


icTR 
ZL 


at 1-p 
+E FF [te (Fao) 
ieL 
iES(1) 


—rM oy «| - AE; ( ye w -AK, ( ss ws) 
ceS(1) ieS(1) ie€S(1) 
+ AY, ( S «| + (PelE, +r2KE,) ( = «| 


—([K, + celE, + pe, Ey + pm N, — Yi] v1 + (@h +AKi) “| 


(€S (1) feS(1) 
+ S> [unY Ni — wiaK Nf —uj4EN? — ujsNN}] 
ieTR 
tF1 
subject to cdjuj1 — aeuig — 8; y ujg 2 0 tETRif#lrEL 
JES(i) 
uig—deui7—dAp YD, uj 20 fETRIF1IEL 
JES(i) 
ur —auje—f >> ue 20 iETRIF1IGL 
ES(i) 
Peuiy—uatr D> uy tu7zS0 ieTRIF1IEL 
JES(i) 


prius, — us +09_7 € S(i)uys > 0 tETR AF leg L 
—untua-rX\ So uja20 ieETRIFILIEL 
jES(i) 

uatdr S> ujetue-r\ 5° ue 20 iETRIFIIEL 
JES(i) 5ES(i) 

céUj1 —Apnug—Uui020 tEL 

ug —deui7t(l+g—Ag)uio 20 tel 

Uj, —AUjg—Usg 20 «EL 

peuy, — U4 tui7 20 sEL 

pnju;; — Us 20 tEL 

Ui tug 20 tEL 

—ugtuet(ltg—A4)ujo>0 tel 
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where KN’, EN? and NN? are computed using the prices Pa a and reat 
t t t 

respectively. 

Electricity Putty-putty Model 

Primal Problem 





} D(i) 
maximize s- mip?" log C; + ) nt log C; 
ieDR feta ose 
t 


subject to CO; +1; + ce JE; +pe,k; +pn;N; -Y; <0 uy) tETR 
K; - AK + P(:) -—al; — BI) <0 wo tETRIF~1 
Ei, -—dgKE; <0 ug tE€TR i #1 
KE, - AEKEpiiy -—aglk,; — BrlEpy) <0 uy tETRIF1 
—i;+(l+g—-A4)K; <0 ws GEL 
-1IE;+(l+g—-AA)KE;, <0 we EL 


Nonkinear Multistage Programs 531 


Dual Problem 





tL 
pon, 1—¢ 
52H Estos ( Afra) 
feL mip?) 


(Li + ce LE, + pe, Fy + pn, N; —Y,Juiy 


+ (#h +AK) ( > | 


JES (1) 


+ (@elE, +\2K Ey) Si +] 


JES (1) 
d, * — ui3 * * 
+ a Ui Yi — Kk, 7 pe, + — E; — pn; 
ieTR vi va 
ifl 
subject to uj; — aujg — 8 x uja 20 2ETR i FLCEL 
JES (i) 
dj —ujg+r J) uj2=0 fETRIFIIEL 
JES (i) 
—drustuia-Ag D> uj 20 FETRIFLIEL 
JES (f) 
cerui —anuisbe > ujs2O iETRIFIIEL 
JES (i) 
Uj) — Aug — ts > 0 tel 


dj — uj — (1+ 9—A4) us =0 1EL 
— drug + ua + (1 +g- AM) ue 20 1eL 
ed; t;1 — @pt;4 — ue > 0 7EL 
where K7, E} and N> are computed using the prices a, (6, + via ) and pn,, 


respectively. 
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29.5 Numerical Experiments 


This section presents some of the results obtained by applying nested decom- 
position on the test problems. Recall here that we deal with four models that 
we run on horizons of 5, 7 and 9 periods. The size of the resulting problems 
given in Table 29.5. 


Nested decomposition proceeds by a sequence of cycles and can be stopped 
when the relative error between the current objective function value and the 
lower bound generated by the algorithsn is sufficiently small or when no propo- 
sition is generated at some cycle. Table 29.6 reports the overall convergence 
properties of the method. 


Although these results certainly appear reasonable if one considers the 
size of the problems, one should keep in mind that they do not give a com- 
plete overview of the method. This is illustrated by considering the evolution, 
through the algorithm, of the objective function and of some of the variables 
of the problem BPP with seven periods (see Table 29.7). It can be seen that 
while the objective function converges rather quickly (cycle 10), all the cycles 
are necessary in order to achieve convergence of the primal variables. 


It is clearly impossible to list the complete optimal solution of those differ- 
ent problems. In order to provide some references for future numerical experi- 
ments, we report in the end of this section the optimal solution of some of the 
variables for the four first periods. The numbering of the nodes is as given in 
Figure 29.1. 


Table 29.5 Size of the Test Problems 


Primal Problem Dual Problem 


onstraints\Constraints [Variables (Constraints [Variables |Variables 
47 30 151 60 46 61 
126 631 252 190 253 
510 2551 1020 766 1021 
211 120 92 91 
883 504 380 379 
3571 2040 1532 1531 
50 
30 


271 1 46 151 
1135 6 190 631 
4591 2550 766 2551 

331 210 122 151 
1387 882 506 631 
5611 3570 2042 2551 
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Table 29.6 Stopping Condition 





5 periods 


deterministic 





5 periods 

stochastic 

7 periods 
deterministic 




















N.B. The first number refers to the number of cycles, the second to the 
relative error. 


Table 29.7 Convergence of some of the Variables. Problem BPP7 Periods. 


Objective Investment in Investment in 
Fanction a Node of a Node of 
Value Period 2 Period 4 


22. 22, 
034908 021512 
029445 024762 
027981 024917 


026600 024993 
025326 025033 
025193 025049 
025113 025071 
025096 025074 
025082 025074 
025078 025075 
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Table 29.8 Optimal Objective Value 


Problems Objective Value 


BPP 5 Periods 21.874876 
BPP 7 Periods 22.025077 
BPP 9 Periods 22.100000 


BPC 5 Periods 21.852174 
BPC 7 Periods 22.004176 
BPC 9 Periods 22.182961 








EPP 5 Periods 21.894914 
EPP 7 Periods 22.044848 
EPP 9 Periods 22.119711 


EPC 5 Periods 21.871362 
EPC 7 Periods 22.023133 
EPC 9 Periods 22.102913 





Table 29.9 BPP 5 Periods: as Solution 


(10°875) (10°$75) 3 |i) fo KWH) (10875) (10°875) 


1.059 340.0 1217.0 
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Table 29.10 BPP 7 Periods: Optimal Solution 





Y 
(10°$75) 
1658.0 
1859.9 
2073.5 
2360.3 
2351.7 
2066.5 
2349.8 
2340.8 
1853.5 
2063.9 
2347.4 
2338.8 
2056.4 
2336.7 
2326.9 














Table 29.11 BPP 9 Periods: Optimal Solution 








5982.0 
6151.4 
6345.1 

699.3 
6975.4 
6321.9 
6960.5 
6912.0 
6153.9 
6322.7 
6912.4 
6919.4 
6297.7 
6873.9 
6844.7 
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Table 29.12 BPC 5 Periods: Optimal Solution 


E I C 
(1012 KWH) (10°$75) } (10°$75) 








Table 29.18 BPC 7 Periods: Optimal Solution 


E I C 
(102 KWH)| (109$75) | (10°$75) 
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Table 29.14 BPC 9 Periods: Optimal Solution 









E I C 
(101? KWH)} (10°$75) | (10 




























Y 
(10°$75) $75) 
1658.0 1.059 | 340.0 | 1217.0 
1904.0 1.318 
2169.1 1.584 
2498.2 1.896 
2493.8 1.938 
2165.6 1.620 
2490.2 1.966 
2485.4 2.010 
1901.0 1.350 
2162.5 1.645 
2487.7 1.987 
2483.0 2.031 
2158.6 1.683 
2479.0 2.061 
2473.8 2.108 








Table 29.15 EPP 5 Periods: Optimal Solution 
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Table 29.16 EPP 7 Periods: ne Solution 


E KE IE 
(quads) (101? KWH) (10°$75) (10°75) (10° KW) (10° KW) 





OP : = 5 21.28 1.059 340.0 1217.0 201.48 25.59 

; 6174.8 1861.6 15.74 1.567 205.7 1575.1 299.04 16.28 
3 6369.0 2076.5 17.54 1.750 307.6 1686.8 332.95 15.09 
4 6986.1 2360.5 20.28 1.892 392.6 1867.1 356.46 21.05 
5 6953.2 2351.1 18.23 1.893 381.6 1856.2 356.48 27.03 
6 6339.6 2066.7 15.78 1.752 297.8 1671.3 332.95 21.09 
7 6935.5 2351.9 17.83 2.031 390.2 1852.3 386.51 19.91 
8 6898.8 2338.6 16.03 2.032 378.0 1841.0 386.51 26.52 
9 6149.0 1853.0 14.15 1.567 197.1 1566.9 298.04 21.26 
10 6318.3 2066.0 15.42 1.882 303.4 1668.0 357.98 16.69 
11 6927.3 2350.0 17.84 2.026 389.6 1851.7 385.62 20.43 
12 6893.2 2377.9 16.03 2.027 378.2 1839.9 385.62 26.80 
13 6287.3 2055.6 13.86 1.882 293.1 1657.2 357.98 23.84 
14 6876.0 2339.5 15.60 2,214 387.8 1835.5 421.35 20.13 
15 6834.4 2325.0 14.01 2.215 373.9 1923.5 421.35 27.61 





Table 29.17 EPP 9 Periods: a Solution 


(10°875) (10°875) (quads) (10%? KWH) (10°875) (10°75) (10° KW) (108 KW) 





5982.0 1658.0 - 28 1.059 340.0 1217.0 201.48 25.59 

; 6159.2 1860.0 15.73 1.552 200.6 1574.4 298.05 23.24 
3 6376.1 2079.3 17.04 1.933 317.6 1679.0 367.76 6.57 
4 6999.4 2362.8 20.59 1.804 388.4 1870.2 343.23 22.90 
5 6967.2 2351.3 18.51 1.804 377.3 1858.8 343.23 29.08 
6 6343.9 2069.4 15.31 1.933 306.9 1668.9 367.76 14.94 
7 6956.0 2352.9 17,92 2.008 389.8 1852.7 385.11 22.87 
8 6915.4 2340.3 16.06 2.024 376.3 1842.7 385.11 29.06 
9 6165.2 1854.7 14.16 1.570 202.6 1566.0 298.05 16.54 
10 6323.7 2065.1 15.74 1.757 297.2 1673.2 334.26 16.54 
11 6900.5 2345.6 18.10 1.918 383.3 1848.5 364.84 28.46 
12 6879.1 2334.8 16.28 1.918 376.2 1838.6 364.84 26.05 
13 6297.1 2054.9 14.15 1.757 288.3 1661.2 334.26 23.55 
14 6864.6 2335.8 15.83 2.101 384.5 1833.2 399.90 24.60 


15 6825.3 2322.4 14.22 2.102 371.4 1918.3 399.90 35.76 
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Table 29.18 EPC 5 Periods: a Solution 


E I Cc 
woth) | 05 


Te ; 
1897.2 
2146.6 
2438.3 
2433.6 
2142.6 





2430.5 


2425.5 

1894.5 

2140.8 : : . ; 326.05 
2429.8 ; 4 . F 381.53 
2424.7 . : f . 381.53 
2136.5 . : : ‘ 326.05 
2421.3 . A . A 388.08 
2415.8 : 5 . “ 388.08 








Table 29.19 EPC 7 Periods: Optimal Solution 


K Y N E I C KE IE 
Nodes| (10°$75) | (10°$75) | (quads) |(10'? K WH)|(10°$75) | (10°$75) |(10® KW) (10° KW) 
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Table 29.20 EPC 9 Periods: sida Solution 


I C KE E 
so Kowa 875 | oo | ot Kw (no 
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CHAPTER 30 


STOCHASTIC PROGRAMMING PROBLEMS: 
EXAMPLES FROM THE LITERATURE 


A.J. King 


Introduction 


This is a small collection of problems which have appeared in the stochastic 
programming literature over the past two and a half decades. The intention 
guiding the choices was to provide a number of test problems with solutions 
for researchers who are developing and testing algorithms. Of course anyone 
can jot down a stochastic linear program. This collection seeks to provide the 
researcher with a variety of formulations, some classical and some new and as 
yet unsolved, as templates from which many test problems can be generated. 

The problems range from classical chance-constrained and simple recourse 
models to dynamic models with both chance-constrained and general recourse 
examples. There are some unfortunate omissions. Chief among these are Kall- 
berg, White and Ziemba’s financial planning model, Prékopa’s et al STABIL 
model, Somlyody and Wets’ Lake Balaton model, and Kall and Keller’s collec- 
tion of general recourse problems. For these we give references together with 
a brief description and classification below; these models were excluded for 
reasons of lack of published data and/or lack of space for presentation herein. 

Finally a disclaimer: None of the solutions has been checked for accuracy. 
There have been many opportunities in the process of collecting and publishing 
these problems for errors to creep in and replicate. Beware! 

We would be grateful for any corrections to these problems, and for any 
additions that may be proposed by the readers for future editions of this col- 
lection. 

In addition some of these problems will be distributed in the special format 
for recourse problems (as described in this volume) on a computer tape to be 
distributed in early 1985 by the International Institute for Applied Systems 
Analysis. 
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AIRCRAFT ALLOCATION PROBLEM 
Reference 


G. Dantzig: Linear Programming and Eatensions, Princeton University 
Press, 1963, pp. 572-597. 


This is the classic example of a stochastic program with simple recourse. 

An airline wishes to allocate airplanes of various types among its routes to 
satisfy an uncertain passenger demand, in such a way as to minimize operating 
costs plus the lost revenue from passengers turned away. 

This problem will be available on the stochastic programming computer 
tape distributed by IIASA. 


Stochastic Programming Problems: Ezamples 


Stochastic program with simple recourse. 


Choose z;(7 =1,...,17) to minimize 
17 5 
ee parr 
yg=il k=1 
subject to 


Ce ee 2-2) 
TE; 029 %Qi 
710,011,012: 
DB, 6+ 692 17* 
b;: 

Cy: 

Qk: 


2 +22 +ag3+a4+45 Sd, 
teta7tagteg <b 

20 +211 +212 <b 

Zigt2at 25 +216 +217 < bg 

zj20 g=1,...,17 

vi 20, v, 20 

Vi Vy) =t121 +t13%13 — by 

Vi —V) =tote +tere ttiotio ttiatia — hy 
vi —vj =t323 +727 +tisti5 — hg 

Va —Vq =targ + texg tte + tietie — hy 


Vi —vV5 =t5x5 +toto +t12t19 +f17217 — hs 


type 1 aircraft assigned to routes 1,...,5 

type 2 aircraft assigned to routes 2,...,5 

type 3 aircraft assigned to routes 2, 4, 5 

type 4 aircraft assigned to routes 1,...,5 
number of aircraft available of type i= 1,...,4 
cost of operating aircraft/route 7 = 1,...,17 


revenue lost per passenger turned away on route k = 1,... 


empty seats on route k 

passengers turned away on route & 
passenger capacity on aircraft/route 7 
passenger demand for route k. 
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Data: 

c= [18, 21, 18, 16, 10, 15, 16, 14, 9, 10, 9, 6, 17, 16, 17, 15, 10] 
= .(13,13,7,%; 1) 
= [10, 19, 25, 15] 

t= [16, 15, 28, 23, 81, 10, 14, 15, 57, 5, 7, 29, 9, 11, 22, 17, 55] 


hy, are discretely distributed as follows 
hi, ~ (200, 220, 250, 270, 300] w.p. (0.2, 0.05, 0.35, 0.2, 0.2) 
hj ~ = [50, 150] w.p. (0.3, 0.7) 
hg ~  [140, 160, 180, 200, 220] w.p. (0.1, 0.2, 0.4, 0.2, 0.1) 
hy~ [10, 50, 80, 100, 340] w.p. (0.2, 0.2, 0.3, 0.2, 0.1) 
h; = [580, 600, 620] w.p. (0.1, 0.8, 0.1) 


Solution: 
Calculated to one decimal place accuracy 


Aircraft Type 1 2 3 
Route 
1 Y= 1 * . 


0 
w= 0 te = 12.8 p= 4.3 
z3= 0 z7= 0.9 * 
0 ¢%3 = 5.3 11> 0 
0 tg = 0 rio = 20.7 


14> 
zs 


om Ow bh 


CLEEF’S TEST PROBLEM 


Reference 


4 
t13 = 7.4 
14= 0.0 
T15> 7.6 
16> 0 
v7 = 0 


H.J. Cleef, “A solution procedure for the two-stage stochastic program with 


simple recourse.” Zeitschrift fir O.R., 25 (1981) 1-13. 


Stochastic program with simple recourse 


choose x;(j =1,...,16) to minimize 
16 6 
Does t EL (at ve + ove) 
j=l k=1 
subject to: 
16 
>. aig = 5; t= 1,2,3 
j=1 


16 

Yo tages t vf —v_ = by k=1,...,6 
j=l 

2; >0,v, 20,v, 20 
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Data 
h is discretely distributed as follows: 
h, ~ [3, 4,5, 6, 7, 8, 10, 11, 12, 13, 14, 15] w.p.(.05, .08, 1, .1, .1, .1, .2, .1, 05, .05, 
04, .03) 
h, ~ {[5, 6, 6.5, 7.0, 7.5] w.p.(.08, .12, .4, .2, .2) 


hy ~ [0, 1.0, 2.0, 2.5, 3.0, 3.25, 3.5, 3.75, 4.0, 4.25, 4.5, 4.75] w.p.(.06, .06, .06, .06, 
.06, .06, .2, .18, .06, .08, .06, .06) 


h,~ [-2.0, -1.0, 1.0, 2.0] w.p.(.13, .5, .25, .12) 
h; ~ [0.0, 1.0, 2.0, 2.5, 3.0, 6.0, 7.0, 10.0, 12.0] w.p.(.08, .12, .1, .1, .2, .1, .1, .12, .08) 
he ~ [8.0, 24.0] w.p. (.5, .5) 


Solution 
2, =0 27 = 0.108757541 21g = 0.955190541 
z2 =0 ts = 0 214 = 0.301132174 
zg = 0.521568144 tg = 0 215 =0 
m4 0 710 = 0.250267079 16 0 
64> 2) = 0.392736833 


ze = 0.203628174 243 = 0.351412474 
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KALL AND KELLER'S COMPLETE RECOURSE PROBLEMS 


Reference 


E. Keller, “GENSLP: A program for generating input for stochastic linear 
programs with complete fixed recourse”. Manuscript, Institut fiir Opera- 
tions Research der Universitat Zurich, Zurich, CH-8006. 


This is a computer program which generates random general recourse prob- 
lems, and is available from computer tape to be distributed by IIASA. The 
program was written by E. Keller under the direction of P. Kall at the Institute 
for Operations Research, University of Ziirich. 

The format of the general recourse problem they pose is as follows: 
choose z € R” to minimize 


eT 2 + E{min q7y} 
y 


subject to 
z>0, y20 


Ag =b 
Tz+Wy=h 


where A,b,c,g,W are deterministic matrices or vectors of appropriate di- 
mension, and T, h are random of the following form: 


T=To4rnt +...t7”T* 
h=A°+rhit...trpA* 


Here (r1,...,7%) is a random vector and (T°,...,7*)(h°,...,A*) are fixed. 

Keller’s computer program generates the data A,b,c,q,W,h°,...,h*, 
T°,...,1* randomly with appropriate checks for feasibility of A and W (for 
complete recourse). P. Kall has conducted a series of tests using the problems 
generated by GENSLP, to appear in 1985. 
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PROJECT SCHEDULING PROBLEM 


Reference 


H.J. Cleef, W. Gaul, “Project scheduling via stochastic programming”, 
Math. Operationsforsch. Statist., Ser. Optimization 18(1982}, 449-468. 


We are given a directed-graph representation of a project where arcs rep- 
resent activities and nodes represent points at which choices between various 
activities must be made. The problem is to choose the length of completion 
time for each activity so that the total time consumed is less than some pre- 
specified limit and the project cost is minimized. Activity completion times are 
subject to lower and upper bounds, and costs increase as completion time is 
Jowered. In some cases the activity completion times are only estimates, and 
there are recourse costs once the true completion time is known: if completion 
time is too short there are costs associated with obtaining additional resources; 
if completion time is too generous then there are gains associated with resources 
freed to work elsewhere. This problem will be available on the IIASA stochastic 
programming computer tape. 


Stochastic program with simple recourse 


Choose 2; (7 =1,...,25) to minimize 


25 25 
2805.0- )) oa; +E{) lay; +9; ¥;]} 
j=l j=1 


subject to 
15 
a; 5 >> ejeme goil,...,25 
k=1 
™m5a-™ 54 
ajtyj—yjp =; fH),...,25 
fy Say Su yfisyy 20 f= 1,...,25 
xj: the scheduled length of time to complete task 7 = 1,...,25 
jet the node-arc incidence matrix 
ae the scheduled time at which decision node & = 1,...,15 is 
reached 
1: specified total time for project 


7 cost of completing scheduled task 7 in one unit less time 


£;,u;: lower/upper bounds on time to complete task 7 
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5: actual project completion time for task 7 
yj: excess time scheduled over actual time for completion of 
task 7 


y;: deficit of scheduled time over actual time for task 7 
q;: per unit value of excess time for task 7 


qj? per unit value of deficit time for task 7 
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Data 


node-arc incidence matrix ex: 


Nodes 


11 12 13 14 15 


9 10 


5 6 7 8 


4 


Arcs 


10 
11 
12 
13 
14 
15 
16 


17 
18 
19 
20 


21 


22 
23 


24 


0 oO -l 1 


0 


[-2, -5, -3, -10, 0, -9, -7, -4, 8, -1, 12, 14, -3, 18, -1, -9, 10, -4, 0, -9, -11, 


-3, -5, 1, 1] 


0 000 00 0 0 0 0 


25 


cs 


[1, 1, 2,1, 0, 3, 2, 3, 6, 1, 5, 3, 1, 6, 3, 1, 6, 2, 0, 1, 2, 3, 1, 2, 3] 


e: 


[31, 31, 31, 31, 0, 31, 31, 31, 12, 31, 13, 12, 31, 24, 31, 31, 18, 31, 0, 31, 


31, 31, 31, 31, 31] 


72s 


30 


[4, 6, 5, 12, 0, 12, 9, 6, 0, 6, 0, 0, 14, 0, 5, 13, 0, 7, 0, 12, 13, 6, 7, 3, 3] 


[-3, -5, -4, -11, 0, -10, -7, -5, 0, -2, 0, 0, -4, 0, -3, -10, 0, -5, 0, -10, -12, 


-4, -6, -2, -2] 


gq: 


Note: the completion times for arcs 7 = 5,9, 11,12,14,17,19 are deterministic, 


hence the recourse penalties are 0. 
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The random completion times are independent, discretely distributed: 


E value probability 

1 3,5,10,12,20 0.2, 0.3, 0.3, 0.15, 0.05 
2 3,5,10,13,20 0.2, 0.3, 0.3, 0.15, 0.05 
3 4,6,8,10,12 0.15, 0.25, 0.25, 0.2, 0.15 
4 2,3,5,6,7 0.1, 0.2, 0.5, O41, O01 
5 deterministic(dummy) 

6 6,9,15,20,25 0.175, 0.55, 0.2, 0.025, 0.05 
7 6,7,8,12,18 0.15, 0.075, 0.3, 0.3, 0.175 
8 6,9,15,20,25 0.175, 0.55, 0.2, 0.025, 0.05 
9 deterministic : 


10 2,3,5,6,7 0.1, 0.2, 0.5, O41, Od 
11 deterministic . 


12 deterministic . 
13 3,5,10,13,20 0.2, 0.3, 0.3, 0.15, 0.05 


14 deterministic : 
15 6,9,15,20,25 0.175, 0.55, 0.2, 0.025, 0.05 
16 3,5,10,13,20 0.2, 0.3, 0.3, 0.15, 0.05 
17 deterministic - 
18 4,6,8,10,12 0.15, 0.25, 0.25, 0.2, 0.15 
19 deterministic(dummy) 
20 2,3,5,6,7 0.1, 0.2, 0.5, 0.1, 0.1 
21 6,7,8,12,18 0.175, 0.55, 0.2, 0.025, 0.05 
22 6,9,15,20,25 0.175, 0.55, 0.2, 0.025, 0.05 
23 2,3,5,6,7 0.1, 0.2, 0.5, 0.1, 0.1 
24 4,6,8,10,12 0.15, 0.25, 0.25, 0.2, 0.15 
25 6,9,15,20,25 0.175, 0.55, 0.2, 0.025, 0.05 
Solution 

optimal value: 2208 

optimal solution: 7; = ™m=12 m11=18 

9= 7 = 719 = 29 


13 =3 mg =18 713 = 28 
14 =6 9 > 19 14 = 27 
B= 710 =18 W15= 30 
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FINANCIAL PLANNING MODEL 


Reference 


J.G. Kallberg, R.W. White, W.T. Ziemba, “Short term financial planning 
under uncertainty”. Management Science 28 6 1982, 670-682. 


A multiperiod simple recourse model with discrete probability models. 
Many variations solved, however some data is lacking in this article. 


The description may be found in the reference; a brief sketch follows. 

A firm must adjust its portfolio of short term assets and liabilities to min- 
imize the net cost of cash surpluses and deficits over a fixed planning horizon. 
Uncertainties arise in the firm’s need for cash, as well as in certain transaction 
costs. (These are modeled as discrete random variables.) The format is as 
follows: 


choose 2; ¢{7 =1,...,14,f =1,...,4) to minimize: 


4 14 4 3 
Dd eiteie +E{>, Dd (adevee + ceevee)} 
t=1j=1 t=1 @&1 
14 
+) (djfost +4505) 
j=l 


subject to: 


14 14 
Dd exerie +> Sjt-12j,t-1 = 9 t=1,. 0254 
j=1 j=l 


14 
Yee —vee= > teewye — bye €=1,2,38, t=1,...,4 
j=l 
+ 


v; Us = 2a 8; j=l,...,14 
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PRODUCT MIX PROBLEM 


This problem is a stochastic version of a linear program derived in {1; p. 50]. It 
is an example of the use of a random technology matrix. The random version 
of this problem is due to R.J-B. Wets. 

A furniture shop has 6000 man-hours available in the carpentry shop and 
4000 man-hours in the finishing shop per period. All employees are on salary, 
however, and the actual man-hours available are assumed to be normally dis- 
tributed random variables with deficits resulting from employee absences and 
surpluses due to voluntary overtime. There are four classes of products each 
consuming a certain number of man-hours in carpentry and finishing; the ac- 
tual time consumed is assumed to be a uniformly distributed random variable. 
Each product earns a certain profit per item, and the shop has the option to 
purchase casual labor from outside. Note that the cost of the salaried labor is 
fixed, and thus does not enter the problem. 


Stochastic program with simple recourse. 


Choose 2;{7 =1,...,4) to maximize 
4 ) 
Do 623 -E{> ave } 
j=l k=1 
subject to 


z;: amount of product 7 produced 

Cj: profit per unit of product 7 

v,: hours of casual labor required of type k 

qe: cost per hour for casual labor of type k 

t,;: hours required of type k to produce product 7 
hy: hours of salaried labor of type & available. 


Data: 

c= [12, 20, 18, 40] 

q= [5, 10] 

h) Normal, mean 6000, st. dev. 100 
h, Normal, mean 4000, st. dev. 50 


ti: ~ U[3.5, 4.5], tig ~ U[8, 10], tig ~ U[6, 8], tig ~ U[9, 11] 
ta: ~ U[0.8, 1.2], tz2 ~ U[0.8, 1.2], tag ~ U[2.5, 3.5], tog ~ U[36, 44] 
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Solution 


This problem has been solved using a technique developed in [2] with computer 
codes developed at IIASA (see [3]). Here are results for a run where the random 
measures we approximated by empirical measures derived from Monte Carlo 
simulations. The accuracy is to within a duality gap of 0.1%. 


Number of samples: 1028 


Solution: 2; = 1384.80 
tg >= 0.0 
@t3 = 0.0 
t4= 55.5370 
Optimal value: 17690.54 


Next we solve the problem where the measures are replaced by a discretiza- 
tion based on conditional expectations. This is the “lower bounding” scheme 
described in [4] and implemented in [8]; of course we obtain an upper bound 
here because this problem is a maximization. Here is the actual discretization 
of the measures. We divide the range of each random variable € into 4 intervals, 
1, 13, Iz, 14 of equal probability p = 4 and calculate €, = E{§|J,}, 4 =1,...,4. 
Then each random variable € is approximated by the discrete random variable 
taking values &), €3, €3,€4 each with probability i: 

h,; = [5872.9331, 5967.49, 6032.51, 6127.0669] w.p.(0.25, 0.25, 0.25, 0.25) 

h, = [3936.4666, 3983.7450, 4016.2550, 4063.5334] w.p.(0.25, 0.25, 0.25, 0.25) 
t11= [3.625, 3.875, 4.125, 4.375] w.p.(0.25, 0.25, 0.25, 0.25) 

ti= [8.25, 8.75, 9.25, 9.75] w.p.(0.25, 0.25, 0.25, 0.25) 

t13= [6.25, 6.75, 7.25, 7.75] w.p.(0.25, 0.25, 0.25, 0.25) 

t14= (9.25, 9.75, 10.25, 10.75] w.p.(0.25, 0.25, 0.25, 0.25) 

t21= [0.85, 0.95, 1.05, 1.15] w.p.(0.25, 0.25, 0.25, 0.25) 

to9= [0.85, 0.95, 1.05, 1.15] w.p.(0.25, 0.25, 0.25, 0.25) 

tos= (2.625, 2.875, 3.125, 3.375] w.p.(0.25, 0.25, 0.25, 0.25) 

ty4= [37.0, 39.0, 41.0, 43.0] w-p.(0.25, 0.25, 0.25, 0.25) 


The solution, again accurate to within 0.1% duality gap, is: 


ry = 1377.26 
zr = 0.0 

Z3 = 0.0 

23 = 55.8027 


Optimal value: 17715.03 
This problem will be available on the stochastic programming computer 
tape to be distributed by IIASA. 
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Reference 


1. G. Dantzig: Linear Programming and Eztensions, Princeton University 
Press, 1963. 

2. R.T. Rockafellar and R.J-B. Wets, “A Lagrangian finite generation tech- 
nique for solving linear-quadratic problems in stochastic programming,” in 
A. Prékopa and R. J-B. Wets, Stochastic Programming 1984 Mathematical 
Programming Study, North Holland (1985) 

3. A.J. King, “An implementation of the Lagrangian finite generation meth- 
od,” this volume. 

4. J. Birge and R.J-B. Wets, “Designing approximation schemes for stochastic 
optimization problems”, in A. Prékopa and R. J-B. Wets, Stochastic Pro- 
gramming 1984 Mathematical Programming Study, North Holland (1985) 


LAKE BALATON MODEL 


Reference 


L. Somlyody and R.J-B. Wets, “Stochastic models for lake eutrophication 
management”. Collaborative Paper CP-85- , International Institute for 
Applied Systems Analysis, Laxenburg, Austria (1985). 


The problem is to choose an optimal level of investments in sewage treat- 
ment facilities so that expected deviations of pollutant concentration levels are 
minimized. The form of the problem is as follows: 


choose z;(7 =1,...,54) to minimize 


ELS) eva Ole, (ve — %)]} 


k=1 
subject to 0 Si <1 ; = 1nd 54 
Dai Fists S by i=1,...,35 


Vv, =h, i test; k=1,...,4 
where @(-) is a linear-quadratic penalty function 


1/27? if0<7r<1 
A(r) =4 7-1/2 ifr>1 
0 ifr <0. 
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This problem has been solved using the Lagrange finite generation technique, 
see [1] and [2]. Details of the problem are available from the above authors. 


1. R.T. Rockafellar and R.J-B. Wets, “A Lagrangian finite generation tech- 
nique for solving linear-quadratic problems in stochastic programming”. In 
A. Prékopa and R. J-B. Wets Stochastic Programming 1984 Mathematical 
Programming Study, North Holland 1985. 


2. A.J. King, “An implementation of the Lagrangian finite generation tech- 
nique”, this volume. 


MULTIPERIOD PRODUCTION PLANNING 


Reference 


R.J. Peters, K. Boskma, and H.A.E. Kuper, “Stochastic programming in 
production planning: a case with non-simple recourse,” Statistica Neer- 
landica 31 (1977), 113-126. 


A factory must decide upon a production schedule and plan increases/de- 
creases in production activity over several periods in order to meet a randoin 
demand for its mix of products. There are costs associated with changes in 
activity from one period to the next. The factory may engage in recourse activ- 
ities, buying product to cover shortages and storing product to cover surpluses. 
A surplus in one period is carried over to the next period thus imparting a 
general recourse feature to this problem. 
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Multistage general recourse problem 
Choose 2, u to minimize 


4 2 3 
Ss b» cjtjtt Yds] ee let uit +e w | 
t=1 j=1 (=1 t=1 
4 2 
. + + — = 
5 2 {min ty ge Vie + Vaal } 
t=1j=1 
subject to: 
aje20 g=1,2 ¢=1,...,4 
uje>O0 ¢=1,2,3, £=1,...,4 


we ,w, 20 §=1,...,4 


2 
So aijaje— wig Sbig = = 1,28, t=1,...,4 


j=l 
wit S Sit 7=1,2,3, t=1,...,4 
2 
hee eee - _ 
S> aaj[zj,e41 — 2] = wy — Uy t= 1,2,3 
j= 


Yj. 29, Yje20 G=1,2, t=1,...,4 
&3,1 ty}, —¥j1 = 651 j= 1,2 
2it +p tp Vie = Fit g=1,2, t=2,3,4 


Uyet amount of product 7 = 1,2 produced in period t = 1,...,4 
re demand for product 7 in period t 

Yipee amount of deficit product 7 purchased in period ¢ 

ye! amount of surplus product 7 stored in period t 

cj: cost of producing product 7 


G0%.t: cost of deficit/surplus product 7 in period t 


Wit? extra capacity of production activity ¢ = 1,2,3 used in period ¢ 
byt: normal capacity of production activity z in period t 

Sit? maximum expansion of capacity for activity 2 in period t 

dj: cost of extra capacity of production activity in z 


w;',w,: change in utilization of production activity 3 from period 
t = 1,2,3 to period +1 =2,3,4 
et,e: cost of change of production activity 3. 
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Data: 
t= 2 3 4 
bs j=1 4000 4000 4000 3500 
2 3000 3000 2500 3000 
3 4500 4500 3750 3500 
f: j=l 400 400 400 350 
2 300 300 250 350 
3 450 450 375 350 
4 5 
ai 3 6 c: (100, 150) 
3.7 
d (15, 20, 10) et =20, e€ =15 
i= 2 3 4 
gq: s=1 25 25 25 100 
2 30 30 30 150 
gt: i= 400 400 400 400 
2 450 450 450 450 


The demands €;; j=1,2 ¢=1,---,4 are independent normal (mean, 
standard deviation): 


Period Product 1 Product 2 
1 (300, 45) (500, 75) 
2 (320, 45) (500, 75) 
3 (440, 45) (500, 75) 
4 (480, 45) (600, 75) 
Solution: 
= 2 3 4 
2 j=l 341.35 304.56 493.22 401.96 
j=2 560.85 576.62 377.91 377.73 
u t=1 170.60 115.80 7.32 27.95 
i=2 0. 0. 0. 0. 
+=3 450.0 450.0 375.0 350.0 
wr: 0. 0. 0. 
we 0 825. 275. 
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MANPOWER PLANNING 


Reference 


E.P.C. Kao, M. Queyranne: Aggregation in a Two-Stage Stochastic Pro- 
gram for Manpower Planning in the Service Sector, Working Paper, Center 
for Health Management, University of Houston, 1981. 


An employer must decide upon a base level of regular staff at various 
skill levels. The recourse actions available are regular staff overtime or outside 
temporary help in order to meet unknown demand for services at minimum 


cost. 


Multistage general recourse problem. 


Choose 2;(7 = 1,2,3) to minimize 


3 12 3 
Boo er E{min Yo lava +1r32;.]} 
j=l f=1 ee 


subject to: 2; > 0 


Y5,t 2 0, By,t 20 


3 3 
Dilys + 35e] 2 & oe Do ay t=1,...,12 
j=l j=l 


Yjt $0.2 ae; 7 =1,2,38, ¢=1,...,12 
ay—ifey—1 + ¥5—1e + By—-14] — [ey + yy +44] 20 7=1,2,3 f= 1,...,12 


Ape 


base level of regular staff at skill level 7 = 1,2,3 

amount of overtime help 

amount of temporary help 

cost of regular staff at skill level 7 = 1,2,3 

cost of overtime 

cost of temporary 

demand for services 

anticipated absentee rate for regular staff at time t = 1,...,12 


ratio of amount of skill level 7 per amount of skill level 7 — 1 required. 


Stochastic Programming Problems: Examples 561 


Data 

c= (7.03, 4.53, 3.44] 

q=  [9.59, 6.18, 4.69] 

r= [11.70, 9.95, 5.78] 

a = [.8943, 8917, .8948, .9086, 9032, .8842, 8513, .8798, .8871, .9043, .8606, .8341] 


+= (0.6, 0.2} 

the demands €,, t =1,...,12 are independent N(7i,,a7) where 7? © ej. 

L= [11975, 11740, 12169, 13132, 13525, 12598, 13503, 14168, 12602, 11807, 11334, 
10410] 


Solution 


The problem is not solved. Instead the author has worked out upper and 
lower bounds for the objective function corresponding to various values of 
(i.e. changing the var'ance of the random demand). 


£ Upper Bound Lower Bound Difference Relative Gap 
UB LB A=UB-LB_ (A/LB) x 100% 
0 852,230 846,706 5524 0.65 
1 852,997 846,287 6690 0.79 
10 855,539 851,458 4081 0.48 
30 859,505 854,706 4799 0.56 


FLOOD CONTROL PROBLEM 


Reference 


A. Prékopa, YT. Szantai: “Flood control reservoir system design using 
stochastic programming.” Mathematical Programming Study 9, North- 
Holland, 1978, 128-151. 


The object is to choose the optimal size of reservoirs placed at certain 
(fixed) locations in order to control flooding due to random stream inputs. The 
criterion is to find the lowest cost solution which controls floods a given per- 
centage of the time. The probability model for the stream flows is multivariate 
normal and gamma. 
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Chance constrained problem 


Choose z;(7 =1,...,5) to minimize: 


5 
ye 5%; 
j=l 


subject to 
O<a;Su; g=1,...,5 
5 5 
P[D > teyz; > Do ter€iak =1,...,9] 2p 
j=1 f=1 
Bz? capacity of reservoir j = 1,...,5 
;3 upper bound on capacity of reservoir 7 


Cj: cost per unit capacity of reservoir 7 

&j: streamflow for tributary 7,7 = 1,...,5 
(The system of conditions T€ < Tz is equivalent to the condition that the net 
flow volume be less than the capacity of the furthest downstream reservoir.) 


Data 
u = [1.0, 1.0, 1.0, 2.0, 3.0] 
¢ = [0.4,0.5,0.6, 1.2, 1.8] 


| 

ll 
ke OFF OOF SCS 
Fre Orocoroc$§$cese 
=e re Or oeccoc & 


See eee ee oO 
ee 


For the random variables €; 7 = 1,...,5 we specify: 


means st. dev. 


é, 0.8 0.2 
é 1.5 0.3 
é, 1.2 0.6 
&, 0.5 0.4 


é; 0.7 0.3 
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Solution 
The problem was solved using three different correlation matrices for the ran- 
dom variables €;,...,&;, (but with the same means and variances given above): 


10 0.0 0.6 0.4 0.0 
0.0 10 05 0.3 0.3 
R, =f 0.6 0.5 1.0 0.7 0.6], 
04 0.3 0.7 1.0 0.4 
0.0 0.3 0.6 0.4 1.0 


1.0 -0.5 0.0 0.3 ~0.5 
R,=| 00 -08 1.0 00 o3], 


—0.5 0.2 0.3 0.0 1.0 
Rz = E, the identity matrix, 


and two probability levels p = 0.8,p = 0.9. The following table gives the 
solutions: 


Numerical Results 
Type of Correlation Probability Objective Computing 
distribution matrix level X, X9X3 X4 Xs function time” 


R, p=0.8 0.807 1 1.356 1.412 5.591 00:52:657 


—_ 


Multivariate p=0.9 0.751 1 1 1.9761.398 6.289  00:35:688 
gamma Ra p=0.8 1 1 1 1.5391.193 5.494  00:16:785 
p=0.9 1 1 1 1.2681.848 6.348  00:11:343 

Ri p=0.8 0.796 1 1 1.5911.383 5.816 01:03:44 

p=0.9 0.998 1 1 1.885 1.524 1.6505 00:25:126 

Multivariate Ry =0.8 0.906 1 1 1.3511.371 5.551  00:58:078 
normal p=0.9 0.833 1 1 1.2391.880 6.214 00:51:426 
Ra p=0.8 1 1 1 1.2261.431 5.547  00:43:461 

p=0.9 1 1 1 1.6501.374 5.953  00:57:478 


Time in minutes ] seconds/ milliseconds. 


In the multivariate gamma case for R, we have 


6 =a(y tyatya)s 

f= Bly + ys + ye +47), 

és = Sly tyetys tystyo+yi0 +911), 
&4 = Fe(va tyet+ys +Yo +412), 

& = &(vatys+yio t+ vis), 


where y1,-.-,413 have standard gamma distributions with the following param- 
eters: 
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0; = 0.576 6 = 0.225 S10 = 0.050 
39 = 0.160 07 = 23.875 O11 = 2.055 
93 = 15.264 0g = 0.140 O19 = 0.758 
04 = 0.315 Jg = 0.280 Jig = 4.940 
Vs = 0.585 

STABIL 

Reference 


(1) A. Prékopa, S. Ganczer, I. Deak, K. Patyi, “The STABIL stochastic pro- 
gramming model and its experimental application to the electrical energy 
sector of the Hungarian economy.”, in M. Dempster, ed., Stochastic Pro- 
gramming, Academic Press, 1980, 369-385. 

(2) A. Prékopa, S. Ganczer, I. Deak, K. Patyi, “A STABIL sztohasztikus pro- 
gramozasi modell és annak kisérleti alkadmazasa a Magyar villamosenergia- 
iparra,” Alkalmazott Matematikai Lapok 1(1975) 3-22 (in Hungarian) 


A large-scale chance constrained model with multi-variate normal and 
gamma distributions. A description of the model is in (2), with an edited 
version in (1). The format of the problem is as follows: 


choose z;(7 =1,...,52) to minimize 
52 
dessa; 
j=l 


subject to: 


2; 20 


62 
So aij2; >b ¢=5,...,110 
j=l 


52 
P{)_ aise; > 6; + 0:8, i =1,2,3,4} 2p. 
j=l 
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GAS DELIVERY PROBLEM 


Reference 
J.-M. Guldmann: “Supply, storage, and service reliability decisions by gas 
distribution utilities: a chance-constrained approach.” Management Sct 
ence 20(8) (1983), 884-906. 


A gas celivery company has two options for gas supply: to purchase from 
a pipeline, cr to withdraw gas from its own storage field. The demand for gas 
consumption is assumed random. The company must make three types of de- 
cisions: to decide the maximum monthly contract (which commits the pipeline 
company to allocate this capacity to the delivery company), to decide to incre- 
ment its storage capacity, and to decide the actual monthly supply request from 
the pipeline company on a month to month basis. The contract decision and 
the storage capacity increment decision are made once at the beginning of each 
year. Any monthly surplus or deficit of pipeline supply vs. gas consumption is 
stored or withdrawn respectively from the gas delivery company’s own storage 
field. The delivery company’s objective is to meet its consumption demand 
at, minimum cost, subject to feasibility constraints on the operation of its gas 
storage facility. 


Chance constrained non-linear version. 
Choose 21,...,2%12, Y, 2% to minimize 
13 


12 
So ejay +0 E>: | zt — &; |} + coy + cre 


t=1 t=1 
subject to 


(te —€:)—a1 Dhii(ee-€)- az <br 
=(% —&)-—a3 S22 (e.- €,)—a4z < be 
) 


P t=1,...,12} >p 
deci — €,)-—a5z < bg 
t 
oexi(%e — €,) 20 
m4: gas ordered from pipeline in month ¢ = 1,...,12 
yi contract capacity per month 
zi gas storage capacity increment 
€;: actual gas consumed in month ¢ = 1,...,12 (random) 
ct: cost per unit of gas ordered in month t = 1,...,12 
eo: cost, of transferring one unit gas into or out of storage facility 


ey3: cost of contract capacity 
cy4: cost of storage facility capacity increment 
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Data. 
€1,-+.5¢7 = 1202.4 cg, +--+, C19 = 1299.3 co = 33.23 
¢13 = 392.0 c14 = 57.0 
a, = —0.078 aj =0.8 6) = 118075.2 
a3= 0.15 a4 = 0.049 bg = 7232.1 
as =0.41 bg = 60513.5 
€ (t=1,...,12) are independent, normal: 
&, = 14,900 + 36.5839, 
where 7; are normal as follows: 
mean st. dev. mean st. dev. 
CBT 506.6 90.5 07 371.6 91.1 
n2 248.2 88.3 "8 712.6 85.6 
"3 50.5 28.8 no 1071.6 145.8 
n4 11.0 9.4 10 1207.7 129.5 
16 18.9 14.1 M11 1046.3 115.2 


16 120.5 42.1 912 892.5 125.4 
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Solution. 


The author did not solve the given problem. The nonlinear term presents 
difficulties. To cope with this difficulty he specifies: 


zt > & t= 1.0037 
> 0. 
ae < & t=8,...,12 z 0.999 


in which case the nonlinear term is approximated by: 


£0 s(x ~ E{&}) + (ete = z)]. 


t=1 


He then obtains the following results: 


Reliability 
Month (¢) 99% 95% 90% 85% 80% 


Monthly Supply z; (MMCF) 


April (1) 46,680 46,680 46,680 46,680 46,680 
May (2) 44,471 36,909 36,909 43,153 42,954 
June (3) 40,571 39,016 37,871 36,615 36,017 
July (4) 38,695 36,846 =. 35,568 += 34,260 = 33,590 
August (5) 30,291 33,319 = 31,499 24,959 24,156 
September (6) 25,472 25,472 25,472 25,472 25,472 
October (7) 41,820 41,820 41,820 41,820 41,820 
November (8) 28,442 28,442 28,442 28,442 28,442 
December (9) 32,764 32,764 32,764 32,764 32,764 
January (10) 40,132 40,132 40,132 40,132 40,132 
February (11) 36,326 36,326 36,326 36,326 36,326 
March (12) 29,197 29,197 29,197 29,197 29,197 


Total 434,861 426,923 422,680 419,820 417,550 
Minimum Storage Capacity (MMCF) 
360,138 324,158 304,927 291,981 281,677 
Expected Purchases and Storage Operation Costs ($1000’s) 
564,093 554,284 549,042 545,509 542,704 


The minimum storage capacity is (in our notation) z + 147594. We refer the 
reader to the reference for more details of the author’s solution. 
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